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The MAILING DATE of this communication appears on the cover sheet with the correspondence address — 
Period for Reply 

A SHORTENED STATUTORY PERIOD FOR REPLY IS SET TO EXPIRE 3 MONTH(S) FROM 
THE MAILING DATE OF THIS COMMUNICATION. 

- Extensions of time may be available under the provisions of 37 CFR 1.136(a). In no event, however, may a reply be timely filed 
after SIX (6) MONTHS from the mailing date of this communication. 

- If the period for reply specified above is less than thirty (30) days, a reply within the statutory minimum of thirty (30) days will be considered timely. 

• If NO period for reply is specified above, the maximum statutory period will apply and will expire SIX (6) MONTHS from the mailing date of this communication. 

- Failure to reply within the set or extended period for reply will, by statute, cause the application to become ABANDONED (35 U.S.C. § 133). 
Any reply received by the Office later than three months after the mailing date of this communication, even if timely filed, may reduce any 
earned patent term adjustment. See 37 CFR 1.704(b). 

Status 

1)^ Responsive to communication(s) filed on 15 June 2004 . 
2a)D This action is FINAL. 2b)0 This action is non-final. 

3) Q Since this application is in condition for allowance except for formal matters, prosecution as to the merits is 

closed in accordance with the practice under Ex parte Quay/e, 1935 CD. 1 1 , 453 O.G. 213. 

Disposition of Claims 

4) ^3 Claim(s) 1-1 1 is/are pending in the application. 

4a) Of the above claim(s) 9-11 is/are withdrawn from consideration. 

5) D Claim(s) is/are allowed. 

6) H Claim(s) 1^8 is/are rejected. 

7) D Claim(s) is/are objected to. 

8) D Claim(s) are subject to restriction and/or election requirement. 

Application Papers 

9) D The specification is objected to by the Examiner. 

10) ^ The drawing(s) filed on 29 June 2001 is/are: a)^ accepted or b)D objected to by the Examiner. 

Applicant may not request that any objection to the drawing(s) be held in abeyance. See 37 CFR 1 .85(a). 
Replacement drawing sheet(s) including the correction is required if the drawing(s) is objected to. See 37 CFR 1.121(d). 

1 1) D The oath or declaration is objected to by the Examiner. Note the attached Office Action or form PTO-1 52. 

Priority under 35 U.S.C. § 119 

12) D Acknowledgment is made of a claim for foreign priority under 35 U.S.C. § 1 19(a)-(d) or (f). 
a)D All b)D Some * c)D None of: 

1 .□ Certified copies of the priority documents have been received. 

2.D Certified copies of the priority documents have been received in Application No. . 




3.Q Copies of the certified copies of the priority documents have been received in this National Stage 
application from the International Bureau (PCT Rule 17.2(a)). 
* See the attached detailed Office action for a list of the certified copies not received. 
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DETAILED ACTION 



> Applicant's election of claims 1 -8 in the reply filed on 1 5 June 2004 is 
acknowledged. Because applicant did not distinctly and specifically point 
out the supposed errors in the restriction requirement, the election has 
been treated as an election without traverse (MPEP § 818.03(a)). 



Claim Rejections - 35 USC § 102 



> The following is a quotation of the appropriate paragraphs of 35 

U.S.C. 102 that form the basis for the rejections under this section made 

in this Office action: 
A person shall be entitled to a patent unless - 

(b) the invention was patented or described in a printed publication in this or a foreign country or in public 
use or on sale in this country, more than one year prior to the date of application for patent in the United 
States. 

> Claims 1-6 and 8 are rejected under 35 U.S.C. 102(b) as being anticipated 
by Fujimoto (USPN 5,912,710). 
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Regarding claim 1, Fujimoto discloses: a color space converter that is 
configured to provide a conversion of pixel values in a first color space to corresponding 
pixel values in a second color space (Please note, figure 9, in correlation to column 1 1 , 
line 56. As indicated an RGB to YCC color space converter); a scaler that is configured 
to provide a scaling of pixel values at a first scale to corresponding pixel values at a 
second scale and a filter that is configured to apply a filter function to pixel values 
(Please note, figure 9, elements 305 and 306, in correlation to column 12, lines 52-55. 
As indicated the current scaling H filter 305 and the next scaling H filter 306 execute the 
horizontal scaling operation on the graphics data for the displaying target line and the' 
next displaying line); wherein the color space converter uses the filter to provide the v 
conversion (As shown on figure 9, the filters 305 and 306 are being utilized in the RGB 
to YCC conversion) and the scaler uses the filter to provide the scaling (As indicated in 
figure 9, elements 305 and 306 the scaling is an integrated part of the filter). 

Regarding claim 2, Fujimoto discloses, wherein a first multiplexer that 
is configured to selectively provide pixel values to the filter to selectively effect the 
conversion and the scaling (Please note, figure 9, elements 301-303, in correlation to 
column 15, lines 12-14. As indicated the Multiplexer 303 alternately selects the current 
buffer 301 and the next buffer 302 and outputs the graphics data in the selected buffer 
at the unit pixel in a serial fashion. As shown in figure 9, these pixel values are being 
provided to the filters 305 and 306 for conversion and scaling). 
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Regarding claim 3, Fujimoto discloses, wherein a second multiplexer 
that is configured to selectively provide color space conversion coefficients and scaling 
coefficients to the filter to selectively effect the conversion and the scaling (Please note, 
figure 11, multiplier 514 performs multiplexing operation). 

Regarding claim 4, Fujimoto discloses, wherein a third multiplexer 
that is configured to selectively provide offset parameters to the filter to selectively effect 
the conversion and the scaling (Please note, figure 11, multiplier 515 performs 
multiplexing operation). 

Regarding claim 5, Fujimoto discloses, wherein the filter is a FIR filter 
(Please note, figure 1 1, in correlation to column 13, line 29, indicative of a FIR filter). 1 

Regarding claim 6, Fujimoto discloses, wherein a memory that 
facilitates communication of pixel values among the color space converter, the scaler, 
and the filter (Please note, figure 9, elements 301 and 302). 

Regarding claim 8, Fujimoto discloses, wherein, the filter includes a 
multiply-add array (Please note, figure 1 1 , being a FIR filter containing three multipliers 
and an adder), and the color space converter uses the multiply-add array of the filter to 
provide the conversion (Please note, figure 1 1 and figure 9, elements 305 and 306, in 
comparing these two figures, note that the multiply-add arrays are being utilized in 
conversion of RGB to YCC), and the scaler uses the multiply-add array of the filter to 
provide the scaling (Please note, figure 1 1 and figure 9, elements 305 and 306, in 
comparing these two figures, note that the multiply-add arrays clear perform scaling). 
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Claim Rejections - 35 USC § 103 

> The following is a quotation of 35 U.S.C. 103(a) which forms the basis for 
all obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the \ 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

> Claim 7 is rejected under 35 U.S.C. 103(a) as being unpatentable over 
Fujimoto (USPN 5,912,710) in view of Landers et al. (USPN 
6,247,036 B1). 

Regarding claim 7, Fujimoto discloses: a color space 
converter that is configured to provide a conversion of pixel values in a first 
color space to corresponding pixel values in a second color space (Please 
note, figure 9, in correlation to column 1 1 , line 56. As indicated an RGB to 
YCC color space converter); a scaler that is configured to provide a scaling of 
pixel values at a first scale to corresponding pixel values at a second scale 
and a filter that is configured to apply a filter function to pixel values (Please 
note, figure 9, elements 305 and 306, in correlation to column 12, lines 52-55. 
As indicated the current scaling H filter 305 and the next scaling H filter 306 
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execute the horizontal scaling operation on the graphics data for the 
displaying target line and the next displaying line); wherein the color space 
converter uses the filter to provide the conversion (As shown on figure 9, the 
filters 305 and 306 are being utilized in the RGB to YCC conversion) and the 
scaler uses the filter to provide the scaling (As indicated in figure 9, elements 
305 and 306 the scaling is an integrated part of the filter). 

However, Fujimoto does not specifically disclose wherein, 
the filter is a 6-tap, 3-element FIR filter. 

On the other hand, Landers et al., in the same field of v 
endeavor disclose wherein, the filter is a 6-tap, 3-element FIR filter. (Please 
note, column 11, lines 37. As indicated a 6-tap filter). 

Therefore, it would have been obvious to one of ordinary 
skill in the art to utilize this 6-tap filter of Landers et al., in Fujimoto, because 
as Landers et al., on column 11, lines 39-40 explains such utilization would 
only require 3 cycles to complete the calculations, in other words, higher 
efficiency. 
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Other prior art cited 

> The prior art made of record and not relied upon is considered pertinent to 
applicant's disclosure. 

Ishihara et al. (USPN 5,448,379) is pertinent as teaching method and 
apparatus for forming color images by converting a color signal to a further color density 
signal. 

Kesatoshi et al. (USPN 5,874,937) is pertinent as teaching method and 
apparatus for scaling up and down a video image. 

Pether (USPN 6,741 ,263 B1 ) is pertinent as teaching video sampling 
structure conversion in BMME. 

Markandey et al. (USPN 6, 128,539) is pertinent as teaching method and 
apparatus for forming image scaling filters. 

Kuwata et al. (USPN 6,055,071 ) is pertinent as teaching image forming 

apparatus. 

Asaida et al. (USPN 5,521 ,637) is pertinent as teaching solid state 
image pick-up apparatus for converting the data clock rate of the generated picture data 
signals. 
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Contact Information 

> Any inquiry concerning this communication or earlier communications from the 
Examiner should be directed to Amir Alavi whose telephone number is (703) 306- 
5913. 

> The Examiner can normally be reached on Monday through Thursday from 8:00 
a.m. to 6:30 p.m. If attempts to reach the Examiner by telephone are unsuccessful, 
the Examiner's supervisor, Leo Boudreau, can be reached at (703) 305-4706. 

Any response to this action should be mailed to: 

Assistant Commissioner for Patents 

Washington, D.C. 20231 
Or faxed to: 

(703) 872-9306, ("draft" or "informal" communications should be clearly 

labeled to expedite delivery to Examiner) 
Hand delivered responses should be brought to Crystal Park II, 2121 Crystal 
Drive, Arlington, VA., Sixth Floor (Receptionist). Any inquiry of a general nature or 
relating to the status of this application should be directed to the T.C. Customer Service 
Office whose telephone number is (703) 306-0377. 
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USPTO TO PROVIDE ELECTRONIC ACCESS TO CITED U.S. 
PATENT REFERENCES WITH OFFICE ACTIONS AND CEASE 

SUPPLYING PAPER COPIES 

In support of its 21 st Century Strategic Plan goal of increased patent e-Govemment, beginning in 
June 2004 the United States Patent and Trademark Office (Office or USPTO) will begin thephase- 
in of its E-Patent Reference program and hence will: (1) provide downloading capability off the 
U S. patents and U.S. patent application publications cited in Office actions via the E-Patent 
Reference feature of the Office's Patent Application Information Retrieval (PAIR) system; and (2) 
cease mailing paper copies of U.S. patents and U.S. patent application publications with 
Office actions (in applications and during reexamination proceedings) except for citations made 
during the international stage of an international application under the Patent Cooperation Treaty 
(PCT) In order to use the new E-Patent Reference feature applicants must: (1) obtain a digital 
certificate and software from the Office; (2) obtain a customer number from the Office; and (3) 
properly associate patent applications with the customer number. Alternatively, copies ot all U.b. 
patents and patent application publications can be accessed without a digital certificate from the 
USPTO web site from the USPTO Office of Public Records, and from commercial sources. 1 he 
Office will continue the practice of supplying paper copies of foreign patent documents and non- 
patent literature with Office actions. Paper copies of cited references will continue to be provided 
by the USPTO for international applications during the international stage. 

Schedule 

June 2004 TCs 1600, 1700, 2800 and 2900 

July 2004 TCs 3600 and 3700 

August 2004 TCs 2100 and 2600 . ' 

All U S patents and U.S. patent application publications are available on the USPTO web site. 
However a simple system for downloading the cited U.S. patents and patent application 
publications has been established for applicants, called the E-Patent Reference system. As E-Patent 
Reference and Private PAIR require participating applicants to have a customer number, retrieval, 
software and a digital certificate, all applicants are strongly encouraged to contact the Patent 
Electronic Business Center to acquire these items. To be ready to use this system by ^ June 1 , 2004, 
contact the Patent EBC as soon as possible by phone.at 866-217-9197 (toll-free), 703-305-3028 or 
703-308-6845 or electronically via the Internet at ebc(2),uspto.gov. 

Other Options 

The E-Patent Reference function requires the applicant to use the secure Private PAIR system 
which establishes confidential communications with the applicant. Applicants using this facility 
must receive a digital certificate, as described above. Other options for obtaining patents which do 
• not require the digital certificate include the USPTO's free Patents on the Web program _ 
(http //www uspto.gov/patft/index.html). The USPTO's Office of Public Records also supplies 
copies of patents for a fee rhti p://ehizl.uspto. e ov/oems25 P /ind ex.htmn. Commercial sources also 
provide U.S. patents and patent application publications. 

USPTO web site. 



NOTICE OF OFFICE PLAN TO CEASE SUPPLYING COPIES OF CITED U.S. PATENT 
REFERENCES WITH OFFICE ACTIONS, AND PILOT TO EVALUATE THE 
ALTERNATIVE OF PROVIDING ELECTRONIC ACCESS TO SUCH U.S. PATENT 

REFERENCES 



Summary 

The United States Patent and Trademark Office (Office or USPTO) plans in the near future to: 

(1) cease mailing copies of U.S. patents and U.S. patent application publications (US patent 
references) with Office actions except for citations made during the international stage of an 
international application under the Patent Cooperation Treaty and those made during 
reexamination proceedings; and (2) provide electronic access to, with convenient downloading 
capability of, the US patent references cited in an Office action via the Office's private Patent 
Application Information Retrieval (PAIR) system which has a new feature called "E-Patent 
Reference." Before ceasing to provide copies of U.S. patent references with Office actions, the 
Office shall test the feasibility of the E-Patent Reference feature by conducting a two-month pilot 
project starting with Office actions mailed after December 1, 2003. The Office shall evaluate the 
pilot project and publish the results in a notice which will be posted on the Office's web site 
(www USPTO gov) and in the Patent Official Gazette (O.G.). In order to use the new E-Patent 
Reference feature during the pilot period, or when the Office ceases to send copies of U.S. patent 
references with Office actions, the applicant must: (1) obtain a digital certificate from the Office; 

(2) obtain a customer number from the Office, and (3) properly associate applications with the 
customer number. The pilot project does not involve or affect the current Office practice of 
supplying paper copies of foreign patent documents and non-patent literature with Office actions. 
Paper copies of references will continue to be provided by the USPTO for searches and written 
opinions prepared by the USPTO for international applications during the international stage and 
for reexamination proceedings. 

Description of Pilot Project to Provide Electronic Access to Cited U.S. Patent 
References 

On December 1, 2003, the Office will make available a new feature, E-Patent Reference in the 
Office's private PAIR system, to allow more convenient downloading of U.S. patents andU.S. 
patent application publications. The new feature will allow an authorized user of private PAIR 
to download some or all of the U.S. patents and U.S. patent application publications cited by an 
examiner on form PTO-892 in Office actions, as well as U.S. patents and U.S. patent application 
publications submitted by applicants on form PTO/SB08 (1449) as part of an IDS. The retrieval 
of some or all of the documents may be performed in one downloading step with the documents 
encoded as Adobe Portable Document format Cpdf) files, which is an improvement over the 
current page-by-pagc retrieval capability from other USPTO systems. 



Steps to Use the New E-Patent Reference Feature During the Pilot Project and 
Thereafter 

Access to private PAIR is required to utilize E-Patent Reference, If you don't already have 
access to private PAIR, the Office urges practitioners, and applicants not represented by a 
practitioner, to take advantage of the transition period to obtain a no-cost USPTO Public Key 
Infrastructure (PKI) digital certificate, obtain a USPTO customer number, associate all of their 
pending and new application filings with their customer number, install no-cost software 
(supplied by the Office) required to access private PAIR and E-Patent Reference feature, and 
make appropriate arrangements for Internet access. The fall instructions for obtaining a PKI 
digital certificate are available at the Office's Electronic Business Center (EBC) web page at: 
<http://www.usoto.gov/ebc/downloads.html> . Note that a notarized signature will be required to 
obtain a digital certificate. 

To get a Customer Number, download and complete the Customer Number Request form, PTO- 
SB125, at: http://www,uspto.gov/web/forms/sb0125.pdf . The completed form can then be 
transmitted by facsimile to the Electronic Business Center at (703) 308-2840, or mailed to the 
address on the form. If you arc a registered attorney or patent agent, then your registration 
number must be associated with your customer number. This is accomplished by adding your 
registration number to the Customer Number Request form. A description of associating a 
customer number with an application is described at the EBC web page at: 
http://www.uspto t gov/ebc/registration pair.html . 

The E-Patent Reference feature will be accessed using a new button on the private PAIR screen. 
Ordinarily all of the cited US. patent and U.S. patent application publication references will be 
available over the Internet using the Office's new E-Patent Reference feature. Thcsize of the 
references to be downloaded will be displayed by E-Patent Reference so the download time can 
be estimated. Applicants and registered practitioners can select to download all of the references 
or any combination of cited references. Selected references will be downloaded as complete 
documents as Adobe Portable Document Format (.pdf) files. For a limited period of time, the 
USPTO will include a copy of this notice with Office actions to encourage applicants to use this 
new feature and, if needed, to take the steps outlined above in order to be able to utilize this new 
feature during the pilot and thereafter. 

During the two-month pilot, the Office will evaluate the stability and capacity of the E-Patent 
Reference feature to reliably provide electronic access to cited U.S. patent and U.S. patent 
application publication references. While copies of U.S. patent and U.S. patent application 
publication references cited by examiners will continue to be mailed with Office actions during 
the pilot project, applicants are encouraged to use the private PAIR and the E-Patent Reference 
feature to electronically access and download cited U.S. patent and U.S. patent application 
publication references so the Office will be able to objectively evaluate its performance. The 
public is encouraged to submit comments to the Office on the usability and performance of the 
E-Patent Reference feature during the pilot. Further, during the pilot period registered 
practitioners, and applicants not represented by a practitioner, are encouraged to experiment with 
the feature, develop a proficiency in using the feature, and establish new internal processes for 
using the new access to the cited U.S. patents and U.S. patent application publications to prepare 
for the anticipated cessation of the current Office practice of supplying copies of such cited 
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references. The Office plans to continue to provide access to the E-Patent Reference feature 
during its evaluation of the pilot. 



Comments 

Comments concerning the E-Patent Reference feature should be in writing and directed to the 
Electronic Business Center (BBC) at the USPTO by electronic mail at eRefaracefguspjogov or 
by facsimile to (703) 308-2840. Comments will be posted and made available for public 
inspection. To ensure that comments are considered in the evaluation of the pilot project, 
comments should be submitted in writing by January 15, 2004. 

Comments with respect to specific applications should be sent to the Technology Centers' 
customer service centers. Comments concerning digital certificates, customer numbers, and 
associating customer numbers with applications should be sent to the Electronic Business Center 
(EBC) at the USPTO by facsimile at (703) 308-2840 or by e-mail at EBC@uspto.gov. 

Implementation after Pilot 

After the pilot, its evaluation, and publication of a subsequent notice as indicated above, the 
Office expects to implement its plan to cease mailing paper copies of U.S. patent references cited 
during examination of non provisional applications on or after February 2, 2004; although copies 
of cited foreign patent documents, as well as non-patent literature, will still be mailed to the 
applicant until such time as substantially all applications have been scanned into IFW. 

For Further Information Contact 

Technical information on the operation of the IFW system can be found on the USPTO website 
at http://www.uspto.gov/web/patents/irw/index.html. Comments concerning the E-Patent 
Reference feature and questions concerning the operation of the PAIR system should be directed 
to the EBC at the USPTO at (866) 217-9197. The EBC may also be contacted by facsimile at 
(703) 308-2840 or by e-mail at EBC@uspto.gov. 





Nicholas P. Godici 
Commissioner for Patents 
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ABSTRACT 



A system and method for displaying graphics data having a 
pixel aspect ratio on a television monitor with improved 
quality including initially up scaling the resolution of the 
graphics data in a horizontal direction so as to keep the 
original pixel aspect ratio of the graphics data constant. The 
graphics data and the television monitor having a different 
aspect ratios. The graphics data are scaled in the horizon Lai 
direction so as to coincide with a horizontal resolution of the 
television monitor. The system improves display of graphics 
images on television monitors while using simplified algo- 
rithms and hardware. 

16 Claims, 17 Drawing Sheets 
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SYSTEM AND METHOD FOR 
CONTROLLING A DISPLAY OF GRAPHICS 

DAJA PIXELS ON A VIDEO MONITOR 
HAVING A DIFFERENT DISPLAY ASPECT 
RATIO THAN THE PIXEL ASPECT RATIO 5 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to a system and method for 10 
controlling a display of graphics data having a pixel aspect 
ratio on a video monitor having a different display aspect 
ratio than the pixel aspect ratio. In particular, the present 
invention relates to a system and method for controlling a 
display of mixed images of the graphics data and motion 15 
picture data on a television monitor with a canceling of the 
difference between the aspect ratios. 

In addition, the present invention relates to a system and 
method for controlling a display of images of non-interlaced 
graphics data on an interlaced video monitor by canceling a 20 
difference between respective aspect ratios. 

Furthermore, the present invention relates to a system and 
method for controlling a display of graphics data having a 
particular pixel aspect ratio on a video monitor having a 
different display aspect ratio without any distortion of the 25 
graphics data pixels. 

2. Discussion of the Background 

In recent years, in accordance with development of com- 
puter technologies, a variety of digital devices, such as ^ 
digital video players or set-top-boxes, have been developed 
for use with an in-home television (TV) receiver used as a 
monitor for displaying computer graphics data, video etc. 
These type of digital devices control reproduction and 
display of mixed data including computer graphics data and 35 
motion picture data stored on a media. The mixed data is 
typically stored on the media as compressed data using 
digital compression coding. An optical disk, such as a 
compact disk (CD) or a digital versatile disk (DVD), are 
" typically used as the data storing media. ^ 

The DVD utilizes a new standard for coding of a video 
information, such as a movie, in order to store a large 
amount of data on a small disk space. For example, a DVD 
coded full-length movie could be stored on a DVD the size 
of a standard CD (typically 640 MB) with a high quality. The 45 
new coding standard for the coding of DVD video informa- 
tion is referred to as motion picture experts group 2 
(MPEG2) coding. For storing large amounts of high quality 
image data on the DVD disk, the MPEG2 coding for the 
DVD basically utilizes a variable rate coding technique for 50 
recording and reproducing the video data. 

The amount of image data that can be stored on a DVD 
using the variable rate coding is dependent upon character- 
istics of the image data. Image data that is made up of scenes 
having a lot of motion require a greater amount of storage 55 
than scenes with lesser motion. The stored motion picture 
data in the DVD are basically reproduced using image 
signals of a particular television standard, such as National 
Television System Committee (NTSC) standard, Phase 
Alternation by Line (PAL), standard, etc. Coasequently, the 50 
video data reproduced from a MPEG2 decoder are provided 
as non-interlaced picture signals to a video monitor. 

In addition, recent developments in computer technology 
have made it possible to increase operational speeds of 
central processing units (CPUs) and graphics controllers. 65 
Accordingly, a higher grade of graphics data, such as three 
dimensional graphics, can be displayed on a television 
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monitor by using a digital video player or a set-top box. That 
is, the users of the set-top box can interactively change the 
displayed images on the video monitor by varying the 
graphics data in various forms. In this way, a display of 
mixed high grade images comprising graphics data and 
motion picture data is now possible. 

However, it is typically not possible to reproduce mixed 
images of graphics data and video data on a television 
monitor with improved graphics data quality. Displaying 
non-interlaced computer graphics data on an interlaced 
television monitor, results in the following problems: 

(1) Differences between graphics data pixel aspect ratio 
and video monitor display aspect ratio: 

Computer graphics data usually have a pixel aspect ratio 
of 1 to 1 (1:1). In contrast, a video or television monitor has 
a different display aspect ratio than that of the graphics data 
pixel. The television monitor usually has a display ratio that 
is wider in a horizontal direction than in a vertical direction 
(e.g., 16:9 or 4:3). As result of this, when the graphics data 
are displayed on the television monitor, without 
modification, the reproduced graphics data are displayed in 
a distorted form rather than in the original form. For 
example, if a circle comprises the graphics data having a 
given pixel aspect ratio for display on a non-interlaced 
computer monitor, when the circle is reproduced on an 
interlaced video or television monitor having a horizontally 
wider aspect ratio, the circle is displayed as a vertically 
elongated ellipse on the television monitor. 

(2) Differences of signal bandwidths of interlaced and 
non-interlaced monitors: 

A television monitor utilizes an interlaced scanning dis- 
play or monitor. On the contrary, a computer uses a non- 
interlaced scanning display. The graphics data displayed on 
the non-interlaced scanning monitor have wider bandwidths 
for brightness and color signals than the bandwidths for 
brightness and color signals for a motion picture (video) 
data. These bandwidth difference of brightness and color 
signals between the graphics data and the motion picture 
data, result in flickering of the reproduced computer graph- 
ics data on the television monitor. 

SUMMARY OF THE INVENTION 

Accordingly, it is an object of the present invention to 
solve the aforementioned problems and defects of displaying 
graphics data having a certain pixel aspect ratio on a 
television monitor having a different aspect ratio with 
improved quality. 

It is another object of the present invention to provide a 
system and method for controlling a display of graphics data 
having a certain pixel aspect ratio on a television monitor 
having a different aspect ratio by effectively canceling the 
difference between the aspect ratios in a simplified pipeline 
operation of scaling and filtering the graphics data. 

It is a further object of the present invention to provide a 
system and method for controlling a display of mixed 
images of graphics data and motion picture data on a video 
monitor having a different aspect ratio than a pixel aspect 
ratio of the graphics data with improved quality. 

It is a still further object of the present invention to 
provide an system and method for canceling the differences 
between the pixel aspect ratio of graphics data and a display 
aspect ratio of a video monitor by scaling in one direction. 

It is a still further object of the present invention to 
provide an apparatus and method for controlling a display of 
mixed images of graphics and video data on a video monitor 
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by an effective pipeline operation of scaling and filtering the 
graphics data without using large expensive video memory. 

It is a still further object of the present invention to 
provide an system and method for simplifying address 
calculation for displaying graphics data on a video monitor 5 
by an efficient access operation to a video memory. 

It is a still further object of the present invention to 
provide a system and method for displaying combined 
images comprised of graphics data and motion picture data 
with improved quality by efficiently canceling a differences 10 
between pixels of the graphics data and the motion picture 
data. 

It is a still further object of the present invention to 
provide a system and method for controlling interactively a ^ 
display of combined images of graphics data and motion 
picture data stored on a DVD on a television monitor with 
improved quality. 

These and other objects are achieved according to the 
present invention by providing a system for displaying ^ 
images constructed by blending motion picture data and 
graphics data on a video monitor comprising, a disk for 
storing the motion picture data having a predetermined 
horizontal and vertical resolutions correspond to a display 
aspect ratio of the video monitor and the graphics data ^ 
having a pixel aspect ratio and having a predetermined 
vertical resolution by scaling up or down the horizontal 
resolution of the motion picture data, and an apparatus for 
controlling image displays on the video monitor by reading 
the data stored in the disk and by generating interlaced ^ 
image signals of the blended data of the motion picture data 
and the graphics data for providing to the video monitor, 
wherein, the image display control apparatus comprising, a 
memory for storing the graphics data read out from the disk, 
a color converting controller for providing color data for the 35 
graphics data by reading out the data for each set of 
predetermined number of display lines from the memory so 
as to generate the same color data for the motion picture 
data, a filter for vertically filtering the converted color data 
for the graphics data with each predetermined number of ^ 
pixels in each of the display lines, a first scalar for scaling 
up or down the pixel aspect ratio of the filtered graphics data 
correspond to the display aspect ratio of the video monitor, 
a decoder for decoding the motion picture data read out from 
the disk under the display aspect ratio of the video monitor, 45 
a second scalar for scaling down the decoded motion picture 
data correspond to a video window size displayed on the 
video monitor, a blending circuit for blending the scaled 
graphics data from the first scalar and the scaled motion 
picture data from the second scalar for every pixels of the 5Q 
blended image, and an encoder for generating interlaced 
image signals for the video monitor by encoding the blended 
data from the blending circuit. 

Furthermore, the system according to the present inven- 
tion is characterized in that tbe system utilizes horizontally 55 
enlarged graphics data by enlarging horizontal resolution of 
the data from a viewpoint that the pixel aspect ratio for video 
signals, such as television image signals, become longer in 
a horizontal direction when they are displayed on a video 
monitor of 16:9 display aspect ratio. ^ 

By scaling down the preliminary enlarged graphics data in 
a horizontal direction, it becomes possible to cancel the 
differences between the pixel aspect ratios for graphics data 
and video signals on the video monitor. 

Furthermore, the system according to the present inven- 65 
tion is characterized in that the horizontal resolution of the 
graphics data is enlarged by determining to a beforehand 
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value that is selected among tbe multiples of 8 or 16 and that 
is closest to a value obtained by multiplying the horizontal 
resolution for the video signal with the pixel aspect ratio of 
the video signal. 

BRIEF DESCRIPTION OF THE DRAWINGS 

A more complete appreciation of the present invention 
and many of the attendant advantages thereof will be readily 
obtained as the same becomes belter understood by refer- 
ence to the following detailed description, when considered 
in connection with the accompanying drawings, wherein: 

FIG. 1 is a block diagram of an image display control 
system according to the present invention; 

FIG. 2 is a sample image, displayed according to the 
system shown in FIG. 1; 

FIG. 3 is another sample image, displayed according to 
the system shown in FIG. 1; 

FIG. 4 illustrates a relationship between a resolution of 
DVD video data used in the system in FIG. 1 and a pixel 
aspect ratio displayed on a television monitor having a 
display aspect ratio of 16:9; 

FIG. 5 shows resolution of graphics data used for the 
system shown in FIG. 1; 

FIG. 6 is used to illustrate relationships between resolu- 
tions for graphics data and video data for producing com- 
bined image data; 

FIG. 7 is used to illustrate a graphics scaling method for 
displaying images constructed by combining DVD video 
data and graphics data on a television having a display 
aspect ratio of 16:9 in the system shown in FIG. 1;. 

FIG. 8 is a block diagram showing a preferable embodi- 
ment of an image display control apparatus according to the 
present invention; 

FIG. 9 is a block diagram of a graphics/video mixer used 
in the image display control apparatus shown in FIG. 1; 

FIG. 10 illustrates a color pallette provided in the image 
display controller shown in FIG. 1; 

FIG. 11 illustrates a horizontal filtering circuit provided in 
the image display controller shown in FIG. 1; 

FIG. 12 is used to illustrate a vertical filtering circuit 
provided in the image display controller shown in FIG. 1; 

FIG. 13 is used to illustrate a vertical filtering operation 
applied in the image display controller shown in FIG. 1; 

FIG. 14 is used to illustrate software structure for setting 
a scaling parameter in the control register provided in tbe 
image display control apparatus shown in FIG. 1; 

FIG. 15 is a flow chart for illustrating the steps for 
graphics scaling applied in the image display control appa- 
ratus shown in FIG. 1; 

FIG. 16 is a block diagram illustrating another embodi- 
ment of the graphics/video mixer provided in the image 
display control apparatus shown in FIG. I; 

FIG. 17 shows another embodiment of a color palette 
provided in the graphics/video mixer shown in FIG. 16; 

FIG. 18 is used to illustrate sequences for dynamically 
changing filtering characteristics of horizontal/vertical fil- 
tering circuits provided in the graphics/video mixer shown 
in FIG. 16; and 

FIG. 19 is a block diagram illustrating another embodi- 
ment of a graphics/video mixer provided in the image 
display control apparatus shown in FIG. 1. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

Referring now to the drawings, wherein like reference 
numerals designate identical or corresponding parts 
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throughout the several views, and more particularly to FIG. 
1 thereof, there is illustrated a system block diagram includ- 
ing an image display control apparatus 300 for displaying 
blended images on a television monitor 200 by combining 
graphics data 100G and motion picture or video data 100B 5 
stored on a DVD media 100 according to the present 
invention. 

In FIG. 1, the image display control apparatus 300 is 
typically referred to as a digital video player or a set-top box. 
The apparatus 300 reads out graphics data 100G and video 1Q 
data 100B (also referred to as "motion picture data") 
recorded on the DVD media 100 and generates image 
signals for displaying the blended images comprising the 
graphics data and motion picture data on the television 
monitor 200 which has a particular display aspect ratio. 

The image display control apparatus 300, as shown in 
FIG. 1, is comprised of a DVD ROM drive 101, a MPEG2 
decoder 102, a video memory, such as a volatile random 
access memory (VRAM) 103, a color data controller 104 
that includes a RGB color palette 104a and a color space 
converter 1046, a display controller 155, that includes a 
filtering circuit 105 and a first scalar 106, for changing pixel 
aspect ratio of the graphics data 100G, a second scalar 107 
for changing a size of the motion picture data 100B so that 
it fits in a video window of a given size on the monitor 200, 
a a-blending circuit 108 and an NTSC/PAL encoder 109 for 25 
providing image signals to the television monitor 200. 

Hie DVD-ROM driver 101 reads data (DVD video), such 
as the video data 100B and the graphics data 100G that are 
stored in the DVD media 100, such as an optical disk. The ^ 
graphics data 100G are used, for example, as a background 
image for the motion picture or video data 100B. The 
graphics data 100G are written into the video memory 103 
by a central processing unit (CPU, not shown in FIG. 1) 
executing an application program for reproducing a DVD 35 
title in the image display control apparatus 300. 

There are two types of image sources for storing as the 
video data 100B in the DVD media 100. One image source 
corresponds to a standard television monitor having a dis- 
play aspect ratio of 4:3. Another image source corresponds ^ 
a wide type television monitor, such as a high-definition 
television (HD-TV) monitor or an extended-definition tele- 
vision (ED-TV) monitor having a display aspect ratio of 
16:9. In either case, the resolution (a horizontal resolutionxa 
vertical resolution) is 720x450. 45 

The video data 100B includes main picture data (video), 
sub-picture data and audio. The video data 100B are coded 
by using MPEG2 coding, and the sub-picture data and the 
audio data are coded by using run-length coding and 
DOLBY AC3, respectively. Under the MPEG2 standard, it 50 
is possible to include another type of coding data in the 
MPEG2 coding data. However, the combined data having 
different coding are treated as a single MPEG2 program 
stream. Accordingly, it is possible to include both a plurality 
of sub-picture data of up to sixteen (16) channels and a 55 
plurality audio data of up to eight (8) channels within the 
main MPEG2 coded video data 100B. 

As explained before, it is possible to vary the recording/ 
reproducing amount per unit time of the MPEG2 coding data 
depending upon the transfer rate of the MPEG2 stream, 60 
because the MPEG 2 coding is basically using a variable-rate 
coding. Accordingly, it is possible to increase the transfer 
rate of the MPEG2 stream corresponding to the motion of 
the displayed scene for reproducing the motion picture data 
with high quality. 65 

There are also two resolutions for storing the graphics 
data 100G in the DVD media 100. One resolution 
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(horizontal resolutionxvertical resolution) is 640x480 for 
displaying on a standard television monitor having a display 
aspect ratio of 4:3. Another resolution is 848x480 for 
displaying on a wide type television, such as a HD or ED 
television having a display aspect ratio of 16:9. In either 
cases, the pixel aspect ratio is 1:1. 

One feature of the present invention, is to preset the 
resolution of the graphics data to 848x480 to correspond to 
a display aspect ratio of 16:9. This feature allows simplify- 
ing a scaling operation for displaying the graphics data on a 
television monitor so as to keep the pixel aspect ratio of the 
graphics data to 1:1. That is, the graphics data can be 
displayed on a television monitor with high quality, by 
preparing, in advance, an enlarged resolution of the graphics 
data in a horizontal direction. 

The MPEG2 program stream read out from the DVD 
media 100 (DVD video) is provided to a MPEG2 decoder 
102. In the MPEG2 decoder 102, the MPEG2 program 
stream is separated into the video data, the sub-picture data 
and the audio data, and the respective data are decoded and 
synchronized. After combining the decoded data of the 
video and the sub-picture data, the combined motion picture 
data are supplied to the second scalar 107. 

The scalar 107 is used to scale down the size of the motion 
picture data (i.e, 720x480), for adjusting the data to fit on 
video window which is smaller than the motion picture data 
size (e.g., less than 720x480). 

The a-blending circuit 108 constructs pixels of picture 
elements by blending the motion picture data from the scalar 
107 and the graphics data from the scalar 106. The con- 
struction ratio between the motion picture data and the 
graphics data is dependent on the value of "a". The a value 
is a parameter for indicating a transmission degree of the 
graphics data in the respective pixels. The transmission ratio 
for the motion picture data in each pixels is represented as 
(1-a). That is, in the case of a=l, only the graphics data are 
displayed, and the motion picture data are not displayed. On 
the other hand, if a«0 only, the motion picture data are 
displayed, and the graphics data are not displayed. 

The television encoder 109 converts the constructed 
image data from the a-blending circuit 108 into interlaced 
image signals under a particular television system standard, 
such as the National Television System Committee (NTSC) 
standard, Phase Alternation by Line (PAL), standard, etc., 
and supplies the television signals of the particular system 
standard to the television monitor 200 for displaying the 
combined images in an interlaced scanning mode. 

The television monitor 200 can typically support two 
types of display aspect ratios. One is a standard mode for 
displaying images having a display aspect ratio of 4:3. And 
the other is a wide screen, for wide type television, such as 
a high-definition television (HD-TV) or an extended- 
definition television (ED-TV), etc. having a display aspect 
ratio of 16:9. A user can select between the two display 
aspect ratios for displaying the images by operating, for 
example, a remote controller. The display aspect ratio is 
changed in the television monitor or receiver when the user 
selects a particular display mode. 

The data processing is executed as follows. As explained 
before, the graphics data 100G read out from the DVD 
media 100 are stored in the video memory (VRAM) 103 
under CPU software execution via the DVD drive 101. The 
graphics data for a predetermined number of display lines 
are read out from the VRAM 103 and supplied to the color 
data controller 104. The color data controller 104 comprises 
an RGB color palette circuit 104a and a color space con- 
verter 104& 



720xjr:480xr-16S (1) 
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For example, during a period of displaying odd number ground graphics, and screen operation buttons 27. In the 

fields, a particular unit of graphics data for displaying a video window 21, an object 24 is displayed as a natural 

particular odd number line and a couple of even number motion picture image. In FIG. 2, a sample object 24 (i.e., 

lines, having the odd number line in between, are alternately subject of the video) is a dolphin and the image data 26 are 

read out from the VRAM 109 in a time sharing fashion. The 5 seaweeds. 

particular unit of graphics data for the particular three lines FIG. 3 illustrates sample displays of the blended image 

are supplied to the RGB color palette circuit 104a succes- comprising motion picture data and graphics data, shown in 

sively for each of the pixels, FIG. 2, in a full screen mode. The motion picture image of 

The RGB color palette circuit 104a converts the pixel data the sample object 24 (e.g., the dolphin), along with the 

to RGB color data. For example, when one pixel of the 10 graphics data images including the title 23, attributes 25, 

graphics data is comprised of an index color mode having image data 26 and operation buttons 27 are displayed in full 

eight bits/pixel, the index color data are converted to a color screen mode. 

data of twenty-four bits for the respective colors of R (red), FIGS. 4 and 5 are used to define the graphics data 
G (green) and B (blue). When the graphics data stored in the according to the present invention. FIG. 4 illustrates a 
VRAM 103 are comprised of RGB color data having more ^ relationship between the resolutions of the DVD video data 
than sixteen bits/pixel, the RGB color palette circuit 104a is 0 f 720x480 and the pixel aspect ratio in the case of display- 
used for correcting the respective RGB colors. i C j DVD data on a television monitor having a display 

The color space converter 1046 converts the RGB color aspect ratio of 16:9. In this case, the pixel aspect ratio of the 

data from the color palette circuit 104a to YCrCb television DVD video data (X:Y) is determined by the following 

standard data. Namely, the color space converter 1046 20 equations: 
converts the RGB color data so as to be the same as the 
motion picture data (DVD video). 

The filtering circuit 105 reduces the brightness and the X:Y-ykm (2) 

color signal bandwidth of the graphics data to adjust the . . ....... 

image signals for the television monitor 200. That is, a unit 25 Accordingly, the television monitor having the display 

of graphics data that includes a particular number of pixels ratio of 16:9 035 ^ obloQ & P* el as P ect ratM> ( LC - 

for each of the three lines are filtered in a horizontal 32:27). 

direction RG * 5 B used to ^ ustrate me necessary resolutions 

The first scalar 106 converts the pixel aspect ratio of x (honzontalxwrtical) for displaying t ^apUcs ^U havu.g 

- l4 . ■ . j » . j » j ■ i 30 a pixel aspect ratio of 1:1 on a television monitor having a 

filtered graphics data so as to correspond to the display , r _.■ c 1£ n . . ; t 

- *• **l . i •» ™k vr i c ♦ display aspect ratio of 16:9, while keeping the pixel aspect 

aspect ratio of the television monitor 200. Namely, the first J \ 1 \ . _ . # , . \?~Zz . 

1JU . . . . . 1 . c ratio of 1:1. In FIG. 5, since the television monitor, having 

scalar 106 scales up or scales down the pixel aspect ratio of . ,. . ^ . r 1<c n . . , . , "73 

.i. ** j # f u. c it. i* if • /i*4- the display aspect ratio of 16:9, has an oblong pixel aspect 

the unit data of each of the lines, in a horizontal direction. . ~T /."v™ » & r r 

™ .. . ... - ~* . . . ... ratio, the pixel aspect ration of 1:1 needs to be up scaled up 

The scaling operation of the first scaler 106 is determined by 1S . ' . r . 7T , A - ^- „wk fZ. 

c i - rvi/r\ .*i i j * in the horizontal resolution direction (vertical size) for 

a program for reproducing a DVD title or a display driver . . A . . v ™- _ ^ . 

, *\ Z \ -.u j *- r*u i *y L.^ displaying on the monitor. As shown in FIG. 5, the final 

(not shown) wuh consideraUon of *e .elataonsh.ps between ^ioL are set to, for example, 848x480. The value 848 

the r^orutionsof the graphics data WOGand the display for ^ horiMntd ^ luUon (ve^ ^ of ^ 

aspect ratio of the television monitor 200. The resolutions of . , . . _ , \ . \. . a 

. . 1ftA/1 . , t r . , selected as the closest value to the value that was obtained 

the graphics data 100G can be read out from a header ^ . . . . . nAr . , t . . . . . r t . . 

• c Kz e *u u* j . 1AAr j-j .» * 40 by multiplying 740 (the horizontal resolution for the video 

information for the graphics data 100G. Further, the display ' v , * ... v .... , ri , , «• 

r L i • • i .j data) by 32/27 within multiple values of 16. In practice, the 

aspect ratio of the television monitor can be recognized 7 / f , r , . . . 

l 4 i c ■ mn v _ 4 following values of resolutions are used as the graphics data 

when the program for reproducing a DVD tide requests a . .■_.«■■ . . 

C _r j* i .* . A / . for displaying on the television monitor having an aspect 

user to select a display aspect ratio or when the image ra tfoof 16*9* 

display control apparatus 300 communicates automatically 45 

to the television 200 via an interface therebetween. (v 848x480; 

As explained above, the filtering circuit 105 and the first ( 2 ) 868x480; and 

scaler 106 are provided for converting the graphics data (3) 832x480. 

100G intended for non-interlaced display and stored in the However, instead of using multiples of 16, it is possible 

VRAM 103 for display on interlaced television monitor 200 50 to use multiples of 8. By using multiples of 8 or 16 as the 

with improved quality. As explained before, the scaled horizontal size of the graphics data, simplification of address 

graphics data through the first scaler 106 are supplied to the calculation for readmg/writiog of the graphics data from the 

a-blcnding circuit 108, and blended with the motion picture video memory (VRAM) 103 can be achieved. The example 

data from the second scalier 107. Then the blended image values of the resolution for the graphics data are calculated 

data are displayed on the monitor 200 through the television 55 for a DVD video source of the NTSC standard. In the case 

encoder 109. °f mat ^ c D VD video source uses the PAL standard system, 

FIG. 2 illustrates sample displays of a blended image the appropriate values of resolutions are selected in order to 

comprising motion picture data and graphics data. In FIG. 2, accommodate that system. 

in a screen 20 are displayed the motion picture data dis- ^ , . 15 ^ 10 J lUuslratc a^^phics scaling memod for 
played in a video window 21 and graphics data 22. The 60 ^isplaymg combined images of DVD video data and graph- 
graphics data 22, for example, the contents of an electronic l <? ^ on a television monitor having a display aspect ratio 
encyclopedia, are displayed as a background image for the of 16 - 9 - explained before, the preferred embodiment is 
video window 21. Tnat is, the graphics data, displayed as the i0 use the following two types of graphics data: 
background image, may comprise a title 23 of the subject of (1) graphics data for the display aspect ratio of 16:9 using 
the video window 21, attribute information 25, such as 65 848x480 resolution; and 

descriptive information of the subject of the video, image (2) graphics data for the display aspect ratio of 4:3 using 

data 26 suitable for the video window 21, such as back- 640x480 resolution. 
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In case of displaying the graphics (resolutions of 848 x in the a-blending circuit 108 and ihe combined data 73 are 

480) for the display aspect ratio of 16:9, the graphics data are displayed on the television monitor 200 having a display 

horizontally scaled down by the first scalar 106 in FIG. 1 and aspect ratio of 4:3. 

are converted to the resolutions of 720x480 so as to be the As explained above, according to the present invention, 

same as the DVD video data. Then the scaled graphics data 5 the graphics data having a pixel ratio of 1:1 are converted 

are blended with the video data from the MPEG2 decoder and displayed with horizontal scaling only. Consequently, 

102 in the a-blending circuit 108 for displaying on the the system and apparatus for controlling the combined 

television monitor 200 having a display aspect ratio of 16:9. displays do not need to include hardware for scaling in a 

By first enlarging the graphics data in a horizontal direc- vertical direction. Accordingly, the cost of the system and 

tion rather than the video data, it becomes possible to cancel 10 apparatus is reduced. 

the differences of the pixel aspect ratios between the graph- FIG. 8 is a detailed block diagram of the image display 

ics data and the video data by simply scaling down the pixels control apparatus 300 shown in FIG. 1. The image display 

for the graphics data in a horizontal direction. This is control apparatus 300 can reproduce a DVD title from the 

because the graphics data are enlarged in a horizontal DVD media 100 by using the DVD-ROM driver 101 and 

direction when they are displayed on the video monitor is also can execute various applications, such as a game 

having the display aspect ratio of 16:9. software or a viewer. 

In the case of displaying the video data supplied from the Before constructing the motion picture data in the 

MPEG2 decoder in the video window 21 as shown in FIG. a-blending circuit 108, the graphics data are scaled to 

2, the second scaler 107 in FIG. 1 controls the change in the change the pixel aspect ratio and are vertically filtered to 

resolutions for the video data. 20 reduce brightness and color signal bandwidth so as to adjust 

In the case of displaying the graphics data of 640x480 resulting image signals for display on the TV receiver 200. 

resolution on a display aspect ratio of 4:3, the graphics data The image display control apparatus can control a display to 

are horizontally scaled down and converted to the resolu- reproduce motion pictures by utilizing the DVD-ROM as 

tions of 540x480 for displaying a letter box by the first scaler shown in FIG. 1, and can also execute many types of 

106 in FIG. 1. Then the scaled graphics data are blended 25 application programs, such as video games and viewer 

with the video data from the MPEG2 decoder in the software. 

a-blending circuit 108 for displaying them on the television The image display control apparatus includes a central 
200 having a display aspect ratio of 16:9. In this case, the processing unit (CPU) U, a host/program controlled inter- 
differences of the pixel aspect ratios of the graphics data and ruption (PCI) bridge 12, a main memory 13, an operating 
display aspect ratio of the video data, due to scaling down 30 system (OS) stored on mask read-only-memory (ROM) 14, 
of the pixels for the graphics data when they are displayed input-output (I/O) control gate array 15, a USB connector 
on the video monitor having the display aspect ratio of 16:9, 16, status displaying light emitting diodes (LEDs) 17 on a 
can be also cancelled. control panel, user switches 18 on the control panel, an 
FIG. 7 is used to illustrate a method for controlling a infrared communication port 19, an RS232C connector 20, 
display of combined data comprising DVD video data and 35 a system BIOS stored in a flash ROM 21, a socket 22 for a 
graphics data on a television having a display aspect ratio of flash memory card, a D/A(Digital-to- Analog) converter 23, 
4:3. In the case of displaying graphics data 71A having an AC3 decoder 24, and an audio mixer 25. In addition, as 
resolutions of 848x480 for a display aspect ratio of 16:9, explained in FIG. 1, the image display control apparatus 
center area data for the 540 pixels in a horizontal direction includes a DVD-ROM drive 101, a MPEG2 decoder 102, a 
are read out from the VRAM by eliminating edge area data 40 video memory (VRAM) 103 and an NTSC/PAL encoder 
in the VRAM. The center area data are scaled up in a 109. 

horizontal direction and converted to the data 71B having The CPU U controls system operations of the image 

the resolutions of 720x480 so as to be the same as the DVD display control apparatus and executes an operating system 

video data. By this up scaling, the pixel aspect ratio becomes and application programs that are loaded in the main 

the same as that of the video data. An actual image data 70B, 45 memory 13. Further, the CPU U executes various driver 

for the display aspect ratio of 4:3, for example, that have programs. 

resolutions of 540x480, are read out as a pan-scanning data The host/PCI bridge 12 is a large-scale Integrated (LSI) 

70A from the MPEG 2 decoder. The pan-scanned video data circuit for converting bi-directional transactions between a 

and the scaled up graphics data arc combined in the processor bus 1 and a PCI bus 2. As shown in FIG. 2, the 

a-blending circuit 108 and the combined data arc displayed 50 host/PCI bridge 12 includes a processor bus interface 121 

on a television monitor 200 having a display aspect ratio of coupled to the processor bus 1 and a PCI bus interface 123 

4:3. Since the graphics data are enlarged in a longitudinal coupled to the PCI bus 2. Further, the host/PCI bridge 12 

direction, when they are displayed on the television having includes a memory controller 122 for controlling access to 

the display aspect ratio of 4:3, the differences between the the main memory 13 or the mask ROM 14 in accordance 

pixel aspect ratios of graphics data and the video data can be 55 with memory read/write transactions provided from the CPU 

cancelled. 11 or another PCI bus master. 

In the case of displaying graphics data 72A having The I/O control gate array (GA) 15 is a single LSI for 

resolutions of 640x480 for display aspect ratio of 4:3, the controlling various I/O devices. In FIG. 8, the I/O control 

graphics data are scaled up in a horizontal direction and GA 15 includes a PCI/ISA bridge 152 for coupling between 

converted to the data 72B having the resolutions of 720x480 60 the internal PCI bus 2a and an internal ISA bus 2b. 

so as to be the same as the DVD video data. By this up A USB interface 153 for controlling peripheral devices 

scaling, the pixel aspect ratio becomes the same as that of the coupled to the USB interface IS3, such as an external 

video data. As in the previous case, an actual image data 70B keyboard, a bus master IDE controller 154 for controlling 

for the monitor display aspect ratio of 4:3, for example, that the DVD-ROM driver, a display controller 155 for scaling 

have resolutions of 540x480, are read out as a pan-scanning 65 and filtering operations of the graphics data and also for 

data 70A from the MPEG2 decoder. The pan-scanned video blending the graphics data and the motion picture data, and 

data 70B and the scaled up graphics data 72B are combined a MPEG2 interface 156 for controlling interfaces between 
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the internal PCI bus 2a and the MPEG2 decoder are coupled displaying target line. The memory controller reads out these 

to the internal PCI bus 2a. current and next displaying graphics data alternately in a 

Included in the display controller 155 is a memory predetermined unit size for the graphics data in a time 

controller 201 for controlling access to the video memory sharing fashion. For example, if the graphics data stored in 

(VRAM) 103, a bit block transfer circuit (Bit Bit) 202 for 5 the VRAM 103 are 16 bits/pixel and the data bus for the 

executing various logical operations between bit maps of a VRAM 103 are 32 bits, the graphics data for the current and 

transferer and a transferee and scaling up or down the bit next displaying target lines are read out alternately at a unit 

maps, a graphics/video mixer 203, and a PCM audio con- size of 2 pixels. The multiplexer 303 selects the current 

trailer 204. The graphics/video mixer 203 includes the RGB bufTer 301 and the next buffer 302 alternately and outputs the 

palette 104a, the color space converter 1046, the filter 105, to graphics data in the selected buffer at the unit pixel in a serial 

the first scaler 106 for the graphics data, the second scaler fashion. The unit pixel data are supplied to the RGB palette 

107 for the video data and the a-blcnding circuit 108. The 104a and the color key circuit 304c. 
PCM audio controller 204 is a PCM audio source for The R BG/YCrCb color space convert er 104 6 is com - 

converting audio data, such as sound effects for the graphics prised of~tnree color tables ot 1UK, 1W ancTlOB for 

data, to PCM audio data. 15 respectively corresponding to the red (R), green (G) and blue 

In the system in FIG. 8, the DVD-ROM driver 101 reads (B) colors, as depicted in FIG. 10. As shown in FIG. 10, each 

an application program file for controlling a reproduction of of the color tables 10R, 10G and 10B is comprised of 256 

the DVD images from the DVD media and loads the entries and an address decoder for selecting one entry among 

program in the main memory 13. Under the control of the the 256 entries by decoding the pixel data of 8 bits. Each of 

program, the CPU 11 reads out graphics data that are used, 20 the entries stores the color data of 8 bits, 
for example, as a background scene for motion picture data In the case that the graphics data of 16 bits/pixel are stored 

read out from the DVD media and writes the data into the in the VRAM 103, the data is divided as follows for the 

VRAM 103. Further, the CPU 11 transfers audio data to the respective pixel data for each color: 
PCM audio controller 204. R=5 bits, G«6 bits and B=5 bits. 

The graphics/video mixer 203 reads out graphics data for 25 The R data of 5 bits are used for an upper bit portion of the 

a plurality of displaying lines that includes a displaying 8 bit data provided into the red (R) color table. Similarly, the 

object line from the VRAM 103, alternately, in a time data of 6 bits and the data of 5 bits are respectively used for 

sharing fashion. Then the mixer 203 executes the horizontal an upper bit portion of the 8 bit data of the respective green 

scaling and the vertical filtering for the graphics data in a and blue color tables as indicated in FIG. 10. Accordingly, 

pipeline operation. The PCM audio controller 204 converts 30 the RGB color palette 104a in FIG. 9 operates as a color l) - 

the audio data to the PCM audio data, and provided them to correction table and outputs color data of 24 bits from the 

the digital-to-analog converter (DAC) 23. graphics data of 16 bits/pixel from the VRAM 103. 

The motion picture data stored in the main memory 13 are In the case that the graphics data stored in the VRAM 103 % 

provided to the MPEG2 decoder 102 through the DVD- are index color data of 8 bits/pixel, the index color data are 

ROM drive 101. In the MPEG2 decoder 102, the video data 35 commonly supplied to the respective R, G and B color tables 

and the subpicture data included in the MPEG2 program and are converted to the color data of 24 bits, respectively, 

stream are respectively decoded and the digital video data The color key circuit 304c in FIG. 9 keeps a corresponding 

under the YCrCb 422 format are generated. The audio data relationship between the color of the pixel data and the a 

included in the. MPEG2 program stream are directly sent value, of .4 bits. The color key circuit 304c provides the a 

from the MPEG2 decoder 102 to the AC3 decoder 24 and the 40 value corresponding to a color for the input pixel data. The 

AC3 decoder 24 decodes the audio data. a value provided by the color key circuit 304c is delayed in 

The decoded digital video data under the YCrCb 422 a delay circuit 307 during the times for the color space 

format are sent to the graphics/video mixer 203 for blending conversion (i.e., the horizontal scaling and the vertical 

with the graphics data. After blending in the a-blending scaling operations). Then the a value is sent to the 

circuit 108, the combined data are converted to the image 45 a-blending circuit 108 through the second FIFO buffer 

signals (referred to as "composite signals" or the "S 3126. 

signals") in the NTSC/PAL encoder 109 for displaying on The RG B/YCrCb color sp ace _c onvertei "1046 convert s the \ 

the television monitor 200. The audio data decoded in the respective R, Gjmd Bj^^ \ 

AC3 decoder 24 arc mixed with the PCM audio data from lfi fc 1^ 4 Sit data corresponding to me YCK> 444 forma t \ 

the PCM audio controller 204 in the audio mixer 25 and 50 ojjlh ^vjdcj2_ilar5 aS . The converted* YCrCb" data arc pro- ! 

supplied to audio input lines of the television monitor 200. vided to the current .scaling H filter 305, or„to^next_ scaling I 

"^JJ PIG. 9 shows a de tailed block diagram of the graphics/ H filter 50*6." the current scaling H filter 305 and the next -— X- 

video mixer 203 and the peripheral circuits shown in FIG. 8. scaling H fi lter 306 executejhe.hpriz^fllal scjdinjLoperatjon [ 

As shown in FIG. 9, the graphics/video mixer 203 includes on the graphics date Jot the displaying target ^line jnd jhe 

— £a current buffer 301, a next buffer 302, a multiplexer 303 , the 55 next" "displaying fine, respectively. 

RGB palette 104a, t he RBG/YCrCb coE)r__space converte r The converted graphics data for the displaying target line 

1046, a color key circuit 304c, a cu rrent scaling H filter 306, in the YCrCb data of the 444 format are supplied to the 

a next scaling H filler 306, a delay circuit 307, a first YCrCb current scaling H filter 305. The converted graphics data for 

444/442 converter 308, a second YCrCb 444/442 converter the next displaying target line in the YCrCb data of the 444 

309, a before line buffer 310, a first FIFO buffer 312a, a 60 format are supplied to the next scaling H filter 306. At first, 

second FIFO buffer 3126, the a-blending circuit 108, and a these scaling H filters execute a horizontal filtering operation 

A x control register 314. for reducing the brightness and the color signal bandwidth of 

The current buffer 301 and the next buffer 302 are used for the 444 formatted YCrCb data in the target line. Next, the 

temporary storage of the graphics data read out from the scaling H filter circuits execute the horizontal scaling opera- 
VRAM 103 by the memory controller 201. The current 65 tion. 

buffer 301 stores the graphics data for a current displaying The horizontal filtering operation is executed in a circuit 

target line and the next buffer 302 stores the data for the next for integrating and summing calculations as shown in FIG. 
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11. FIG. 11 depicts a horizontal filtering circuit constructed 
with three taps. For executing the integration and sum 
calculation for the successive 3 pixels, the horizontal filter- 
ing circuit comprises first and second delaying circuits 511 
and 512 for respectively delaying one pixel and first, second 
and third integrators 513, 514 and 515 and an adder 516. An 
integration value (coefficient) for the respective integrators 
is determined as a function of a control parameter value 
provided from the control register 314 in FIG. 9. 

The horizontal scaling is executed by thinning the pixel 
that is filtered in the horizontal filtering circuit or by suc- 
cessively outputting the same pixel. The execution contents 
of the horizontal scaling is determined by a H-CONT value 
of the scaling parameter provided from the control register 
314. The H-CONT value of the scaling parameter is deter- 
mined as a function of both the display aspect ratio and the 
resolutions for the graphics source. The horizontally filtered 
and scaled pixels for the displaying target line are sent to the 
YCrCb 444/442 converter 308 in FIG. 9 for converting to 
the YCrCb data of 422 format having a 16 bit width. The 
YCrCb data are provided to the vertical filter circuit 311 in 
FIG. 9. 

The next scaling/H filter 306 executes the horizontal 
filtering operatiop for reducin g the brightness and the color 
signal bandwidth~of the converted YCrCb data of the 444 
format for the next target line and then executes the hori- 
zontal scaling operation. The hori zontal filtering op eration is^ 
executed in t he circuit for integrating and summing calgA- 
l afions rj^ j^ n^nPlR filter a s shown in FIG, lj," The 
horizontally filtered and~sclu r ea r pixels for the displaying 
target line are sent to the YCrCb 444/442 converter 309 in 
FIG. 9 for converting to the YCrCb data of 422 format 
having a 16 bit width. The converted YCrCb data are 
provided to the vertical filter circuit 311 in FIG. 9. The Y 
data among the converted YCrCb d ata are stored into a 
before line buffer 310 in FIG. 9 for use as one-before line 
data during a subsequent scanning operation. 

The vertical filter 311 in FIG. 9 executes the vertical 
filtering, operation among the three lines by using a pixel 
position data corresponding to the before line stored in the 
before-line buffer 310. The vertical filtering operation is also 
executed in a circuit for integrating and summing calcula- 
tions by using a FIR filter as shown in FIG. 12. As shown in 
FIG. 12, toe vertical filtering circuit comprises first, second 
and third integrators 611, 612 and 613 feeding an adder 614 
for calculation of integration and summing of successive 
three lines. An integration value (coefficient) for the respec- 
tive integrators is determined by a control parameter value 
provided from the control register 314 in FIG. 9. 

FIG. 13 is used to illustrate the vertical filtering operation. 
In this example, the displaying target line is the line L3. The 
vertical filtering is executed from the front position pixel for 
the respective lines L2, L3 and L4 in a successive order. By 
this vertical filtering, graphics data of one line is filtered for 
displaying in an odd field of the television. In the next 
scanning operation, the line L5 becomes the displaying 
target line and is the next adjacent odd line to line L3, and 
the vertical filtering is executed among the three lines L4, L5 
and L6. 

After the vertical filtering, the obtained pixel data are 
supplied to a synchronizing information insertion circuit 
(Gen656) 312c in FIG. 9. The insertion circuit 312c gener- 
ates the 1TU-R656 type digital data that are the same as the 
decoded DVD video data by inserting synchronization infor- 
mation in the YCrCb data. 

In the a -blending circuit 108 in FIG. 9, the graphics data 
and the DVD video data are combined as a function of the 
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a value of the pixel position corresponding to the displaying 
target line. This blending is performed at a timing in 
accordance with the synchronization signals in the DVD 
video data detected by the timing control circuit 204 in FIG. 
5 9. The control register 314 can be read/written by the CPU 
11 based on a control parameter designating the character- 
istics of the horizontal/vertical filterings and the horizontal 
scaling. 

As explained above, in the circuit of FIG. 9, the horizontal 

10 scaling and the vertical filtering are executed as a pipe-Line 
operation by executing the horizontal scaling of the graphics 
data for a plurality lines including the displaying target Line 
alternately in a time sharing fashion and then by executing 
the vertical filtering against the horizontally scaled data in 

is the same pixel positions in the plurality of lines at a 
predetermined unit. By performing this pipe- line operation, 
the scaling and the vertical filtering can be effectively 
executed without using the off-screen areas of the VRAM 
103. Consequently, it becomes possible to display the graph- 

20 ics data on a television monitor 200 with improved quality 
without using a special and/or large video memory. 

FIG. 14 is used to illustrate a method for setting a control 
parameter (scaling parameter) of the horizontal scaling in 
the control register 314 shown in FIG. 9. The DVD repro- 

25 duction control application program 114A detects whether 
the graphics source read out from the DVD media has the 
resolutions corresponding to the display aspect ratios of 16:9 
or the 4:3 by using the header information embedded in the 
graphics data. The detected result is provided to a display 

30 driver U4B. 

The display driver 114B determines the scaling method 
based on the TV display aspect ratio of 16:9 or 4:3 that is 
selected by a user or is automatically detected and the 
display aspect ratio of the graphics data. Then, the display 

35 controller 114C sets the scaling parameter for performing 
the determined scaling method in the control register 314. 

FIG. 15 is a flowchart illustrating a control method of the 
graphics scaling applied to the graphics data. In FIG. 15, at 

-. first, the display aspect ratio . of the television monitor is 

40 determined (step S101). If the monitor is selected to display 
the images in the display aspect ratio of 16:9 (A branch at 
step S101) and the display aspect ratio of the graphics has 
the same ratio and has the corresponding resolutions of 
848x480 (Y branch at step S102), the horizontal down 

45 scaling of the resolutions are performed from 848x480 to 
720x480 (step S103). If the monitor is selected to display in 
the display aspect ratio of 4:3 (B branch at step S101) and 
the display aspect ratio of the graphics has the same ratio and 
has the corresponding resolutions of 640x480 (Y branch at 

50 step S102), the horizontal up scaling of the resolutions are 
performed from 640x480 to 720x480 (step S106). If the 
display aspect ratio of the graphics is the 16:9 (N branch at 
step S105), two types of horizontal scaling are performed 
depending upon the original resolutions as depicted at the 

55 stepS107. 

FIG. 16 illustrates a block diagram of another embodi- 
ment of the graphics/video mixer 203 and the peripheral 
circuits shown in FIG. 8. In this embodiment, the graphics/ 
video mixer 203 includes a current buffer 301, a next buffer 

60 302, a multiplexer 303, a color palette 304, a current scaling 
H filler305; T oe'xT scaling H filter 306, a delay circuit 307, 
a first YCrCb 444/442 converter 308, a second YCrCb 
444/442 converter 309, a before line buffer 310, a first FIFO 
buffer 312a, a second FIFO buffer 3126, an a -blending 

65 circuit 108, and a control register 314. 

The current buffer 301 and the next buffer 302 are used for 
temporary storage of the graphics data read out from the 
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VRAM 103 by the memory controller 201. The current 
buffer 301 stores the graphics data for a current displaying 
target line and the next buffer 302 stores the data for the next 
displaying target line. The memory controller reads out these 
current and next displaying graphics data alternately in a 5 
predetermined unit size for the graphics data in a time 
sharing fashion. For example, if the graphics data stored in 
the VRAM 103 are 8 bits/pixel and the data bus for the 
VRAM 103 are 32 bits, the graphics data for the current and 
next displaying target lines are read out alternately at a unit 10 
size of 4 pixels. 

The multiplexer 303 alternately selects the current buff er 
301 and the n cjrtj?ujfer 302 and outputs the graphics data i n 
the s elected bu ffer at the unit pixe l in a serjiaj fashiW- The 
urfir pixel data are supplied to the color palette 304. The 15 
color palette 304 is comprised of 256 entries and an address 
decoder for selecting one entry among the 256 entries by 
decoding an input index value of the 8 bit pixel data, as 
shown in FIG. 17. Each of the entries is registered as a set 
of the YCrCb data of 24 bits corresponding to the 444 format 20 
and the a value of 8 bits. 

The output YCrCb data of the 444 format from the color 
palette 304 are sent to the current buffer 301 or the next 
buffer 302. The a value is sent to the delay circuit 307 for 
delaying during the horizontal scaling and the vertical 25 
scaling operations. Then the a value is sent to the 
a-blending circuit 108 through the second FIFO buffer 
3126. The current scaling H filter 305 and the next scaling 
H filter 306 execute the horizontal scaling operation on the 
graphics data for the displaying target line and the next 30 
displaying line, respectively. 

The converted graphics data for the current displaying 
target line in the YCrCb data of the 444 format are supplied 
to the current scaling H filter 305. The converted graphics 
data for the next displaying target line in the YCrCb data of 35 
the 444 format are supplied to the next scaling H filter 306. 
Hie scaling H filters execute a horizontal filtering operation 
for reducing the brightness and the color signal bandwidth of 
the 444 formatted YCrCb data in the target line. Next, the 
scaling H filter circuit executes the horizontal scaling opera- 40 
tion. The horizontal filtering is executed in a circuit for 
integrating and summing calculations by using a FIR filter. 
The horizontally filtered and scaled pixels for the displaying 
target line are sent to the YCrCb 444/442 converter 308 for 
converting to the YCrCb data of a 422 format. The converted 45 
YCrCb data are supplied to the vertical filter circuit 311. The 
Y data among the YCrCb data is stored in the before-line 
buffer 310 for use during the next scanning operation as a 
previous scanning line data. 

The vertical filter 311 executes the vertical filtering opera- 50 
tion among the three lines by using a pixel position data 
corresponding to the before line stored in the before-line 
buffer 310 at every time when the pixel data for the 
displaying target line and for the next line are completed. 
The vertical filtering is also executed by integrating and 55 
summing calculations used by the FIR filter. The obtained 
pixel data from the vertical filtering are sent to the a 
blending circuit 108 through the first FIFO buffer 312a for 
combining with the motion picture data of the YCrCb 422 
format under the a value of the pixel corresponding to the 60 
position of the displaying line. 

Further, the current buffer 301 is directly coupled to the 
first FIFO buffer 312a through a direct path 161 for direcdy 
sending the graphics data to the NTSC/PAL encoder 109 
through the first FIFO buffer 312a and the a-blending circuit 65 
108. This direct path is used when the CPU instructs to write 
a stationary image by using a plurality of colors, more than 
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the data of the index mode, such as the plurality of color data 
of 16 bits/pixel into the video memory 103. 

By changing the graphics data processing route between 
the index color mode case and the plurality of color data 
case, it becomes possible to obtain an appropriate quality 
display, efficiently. The selection of the data processing 
routes is performed by setting a particular value in the PCI 
interface 155a so as to change the inputs to the first FIFO 
buffer 312a. 

As explained above, by using the line buffer 310 for 
storing the graphics data for the previous one line in the 
graphics/video mixer, it becomes possible to perform the 
vertical filtering among the data for the three lines by 
reading out the graphics data for the two lines alternately in 
a time sharing fashion. In this embodiment, the vertical 
filtering is performed among the three lines. However, it is 
possible to perform the vertical filtering among, for 
example, five lines. 

FIG. 18 illustrates software structure for dynamically 
changing the filtering characteristics by using the control 
register 314 in FIG. 16. The application program 18 A selects 
the filtering operation and transmits it to the operating 
system 18B. In the operating system 18B, a filter setting 
function 18B-F receives the instruction from the application 
program 18A. The instruction for changing the filtering 
characteristics is issued from the application program 18A, 
and the instruction is transferred to a display driver 18C 
from the operating system 18B. The display driver 18C is a 
program for controlling a display controller 18D. The dis- 
play driver 18C set a predetermined control parameter in the 
control register 314 in response to the instruction for chang- * 
ing the filtering characteristics. In this way, it is possible to 
dynamically change the filtering characteristics for character 
displaying portions and for image displaying portions. 
Accordingly, the best filtering characteristics for given 
image contents can be automatically obtained. 

FIG. 19 is a block diagram of another embodiment of the 
graphics/video mixer 203. This graphics/video mixer per- 
forms the vertical filtering of. the three lines without the line 
buffer as depicted in FIG. 16. Instead of using the line buffer, 
the graphics/video mixer has three buffers. That is, the 
graphics/video mixer includes a current buffer 401 for 
storing the graphics data for a displaying target line, a next 
buffer 402 for the next line and a before buffer 403 for the 
before line. Further, in order to perform the horizontal 
scaling of the respective lines in parallel, a current scaling H 
filter 406, a next scaling I I filter 407 and a before scaling H 
filter 408 are provided for the respective lines. 

The vertical filter 413 is constructed so as to perform the 
vertical filtering not only for the Y data but also for both of 
Cr and Cb data. As a consequence, it becomes possible to 
display blended images with high quality without blurring 
the colors as compared to the case of the vertical filtering for 
the Y data only. 

As explained above in detail, the system and method 
according to the present invention can display mixed titles of 
graphics and motion pictures on an interlaced video monitor 
with improved quality. Further, system and method accord- 
ing to the present invention can perform the scaling and 
filtering, efficiently, without using an expensive video 
memory having a wide offscreen area. 

The present invention is described in terms of displaying 
mixed data on a monitor that has a longer width in a 
horizontal direction. However, it is also possible to apply the 
teachings of the present invention for displaying mixed data 
on a monitor that has a longer width in a vertical direction 
as will be apparent to those skilled in the art. 
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Obviously, numerous modifications and variations of Ihe 
present invention are possible in light of the above teach- 
ings. It is therefore to be understood that within the scope of 
the appended claims, the invention may be practiced other- 
wise than as specifically described herein. 5 

What is claimed as new and desired to be secured by 
Letters Patent of the United States is: 

1. A system for controlling a display of mixed images of 
graphics data having a pixel aspect ratio and motion picture 1Q 
data on a video monitor having a display aspect ratio 
different from the pixel aspect ratio, comprising: 

means for storing the graphics data and the motion picture 

data wherein the graphics data is pre-scaled up or 

pre -scaled down in at least one of a horizontal or 15 

vertical resolution direction; and 
means for controlling generation of the mixed images of 

the graphics data on the video monitor, comprising: 

means for reading out the graphics data and the motion 20 
picture data from the storing means; 

means for memorizing the pre-scaled graphics data; 

means for converting the graphics data read out from 
the memorizing means to a particular color data 
applicable to the video monitor; 25 

means for filtering the graphics data for reducing 
brightness and color band width of the graphics data; 

means for scaling down or up the resolution of the 
filtered graphics data in accordance with the display 
aspect ratio for the video monitor; and 30 

means for mixing the scaled down or up graphics data 
from the scaling means and the motion picture data 
on a video window of the video monitor. 

2. The system according to claim 1, wherein: 
the scaling means comprises: 

a first means for scaling the graphics data for displaying 
a current target line; and 

a second means for sea ling , the graphics data for 

displaying a next target line. 40 

3. The system according to claim 1, further comprising: 
first and second means for buffering the graphics data 

from the memorizing means for displaying a current 
target line and a next target line, respectively. 

4. The system according to the claim 1, wherein: 45 
the horizontal resolution of the graphics data is deter- 
mined to be a value that is selected among multiples of 

8 or 16 and so as to be a closest value of multiplication 
of a horizontal resolution of the motion picture data and 50 
a pixel aspect ratio of the motion picture data. 

5. The system according to the claim 1, wherein: 

the resolutions (horizontalxvertical) of the motion picture 

data are 848x480; and 
the resolutions (horizontalxvertical) of the graphics data 55 

are 848x480. 

6. The system according to the claim 1, wherein: 

the resolutions (horizontalxvertical) of the motion picture 
data are 720x480; and 

60 

the resolutions (horizontalxvertical) of the graphics data 
are 868x480. 

7. The system according to the claim 1, wherein: 

the resolutions (horizontalxvertical) of the motion picture 
data are 720x480; and 65 

the resolutions (horizontalxvertical) of the graphics data 
are 832x480. 
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8. The system according to the claim 1, further including: 
means for blending the scaled graphics data and the 

motion picture data in order to display the blended 
image on the video monitor. 

9. A system for controlling a display of graphics data 
having a pixel aspect ratio of 1: 1 on a video monitor having 
a display aspect ratio of 16:9, comprising: 

means for storing the graphics data and video data, 
wherein a resolution of the graphics data is pre-scaled 
up in a horizontal resolution direction so as to keep a 
same pixel aspect ratio of 1:1 when the graphics data 
arc displayed on the video monitor having the display 
aspect ratio of 16:9; 

means for reading out a center area portion of the graphics 
data by eliminating side area portions from the graphics 
data; 

means for scaling down the read out graphics data in a 
horizontal resolution direction to coincide to a horizon- 
tal resolution of the video data; and 

means for displaying the scaled graphics data on the video 
monitor. 

10. The system according to the claim 9, wherein: 

the horizontal resolution of the graphics data is deter- 
mined to be a value that is selected among multiples of 
8 or 16 and so as to be a closest value of multiplication 
of a horizontal resolution of the video data and the pixel 
aspect ratio of the video data. 
U. The system according to the claim 9, wherein: 
the resolutions (horizontalxvertical) of the video data are 
720x480; and 

the resolutions (horizontalxvertical) of the graphics data 
are 848x480. 

12. The system according to the claim 9, wherein: 

the resolutions (horizontalxvertical) of the video data are 
720x480; and 

the resolutions (horizontalxvertical) of the graphics data 
are 868x480. 

13. The system according to the claim 9, wherein: 

the resolutions (horizontalxvertical) of the video data are 
720x480; and 

the resolutions (horizontalxvertical) of the graphics data 
are 832x480. 

14. The system according to the claim 9, further includ- 
ing: 

means for outputting the video data having the display 
aspect ratio of 16:9 under a path scanning mode; and 

means for displaying and blending the scaled graphics 
data and the video data supplied by the path scanning 
mode in order to display the blended image on the 
video monitor. 

15. A system for controlling a display of graphics data 
having a aspect ratio of 16:9 or 4:3 included in an image 
source on a video monitor having a display aspect ratio of 
16:9 or 4:9, comprising: 

means for reading out the graphics data from a memory 

storing the image source; 
means for scaling up or down the read out graphics data 

in a horizontal resolution direction as a function of the 

graphics data in the image source and the display aspect 

ratio of the video monitor; and 
means for displaying the scaled graphics data on the video 

monitor; 

wherein the horizontal resolution of the graphics data is 
determined to be a value that is selected among mul- 
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tiples of 8 or 16 and so as to be a closest value of 
multiplication of a horizontal resolution of toe graphics 
data and the pixel aspect ratio of the graphics data. 
16. A method for controlling to display graphics data 

having a pixel aspect ratio of 1: 1 on a video monitor having 5 

a display aspect of 16:9, comprising: 
storing the graphics data and video data, wherein a 
resolution of the graphics data is pre-scaled up in a 
horizontal resolution direction so as to keep a same 
pixel aspect ratio of 1:1 when the graphics data are 
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displayed on the video monitor having the display 
aspect ratio of 16:9; 

reading out a center area portion of the graphics data by 
eliminating side area portions from the graphics data; 

scaling up the read out graphics data in a horizontal 
resolution direction to coincide to a horizontal resolu- 
tion of the video data; and 

displaying the scaled graphics data on the video monitor. 

* * ♦ * * 
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PROCESSOR WITH RECONFIGU RABLE 
ARITHMETIC DATA PATH 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

This application claims priority in Provisional Applica- 
tion Serial No. 60/010317, filed Jan. 22, 1996. 

TECHNICAL FIELD OF THE INVENTION 

Hie present invention pertains in general to dual proces- 
sors and, more particularly, to a digital processor that has a 
plurality of execution units that are reconfigurable and 
which utilizes a multiplier-accumulator that is synchronous. 

BACKGROUND OF THE INVENTION 

Digital single processors have seen increased use in recent 
years. This is due to the fact that the processing technology 
has advanced to an extent that large fast processors can be 
manufactured. The speed of these processors allows a large 
number of computations to be made, such that a very 
complex algorithms can be executed in very short periods of 
time. One use for these digital single processors is in 
real-time applications wherein data is received on an input, 
the algorithm of the transformer function computed and an 
output generated in what is virtually real-time. 

When digital single processors are fabricated, they are 
typically manufactured to provide a specific computational 
algorithm and its associated data path. For example, in 
digital filters, a Finite Impulse Response (FIR) filter is 
typically utilized and realized with a Digital Single Proces- 
sor (DSP). Typically, a set of coefficients is stored in a RAM 
and then a multiplier/accumulator circuit is provided that is 
operable to process the various coefficients and data in a 
multi-tap configuration. However, the disadvantage to this 
type of application is that the DSP is "customized" for each 
particular application. The reason for this is that a particular 
algorithm requires a different sequence of computations. For 
example, in digital filters, there is typically a multiplication 
followed by an accumulation operation. Other algorithms 
may require additional multiplications or additional opera- 
tions and even some shift operations in order to realize the 
entire function. This therefore requires a different data path 
configuration. At present, the reconfigurable DSPs have not 
been a reality and they have not provided the necessary 
versatility to allow them to be configured to cover a wide 
range of applications. 

SUMMARY OF THE INVENTION 

The present invention disclosed and claimed herein com- 
prises a reconfigurable processing unit. The reconfigurable 
unit includes a plurality of execution units, each having at 
least one input and at least one output. The execution units 
operate in parallel with each other, with each having a 
predetermined executable algorithm associated therewith. 
An output selector is provided for selecting one or more of 
the at least one outputs of the plurality of execution units, 
and providing at least one output to an external location and 
at least one feedback path. An input selector is provided for 
receiving at least one external input and the feedback path. 
It is operable to interface to at least one of the at least one 
inputs of each of the execution units, and is further operable 
to selectively connect one or both of the at least one external 
input and the feedback path to select ones of the at least one 
inputs of the execution units. A reconfiguration register is 
provided for storing a reconfiguration instruction. This is 
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utilized by a configuration controller for configuring the 
output selector and the input selector in accordance with the 
reconfiguration instruction to define a data path configura- 
tion through the execution units in a given instruction cycle. 

5 I another embodiment of the present invention, an input 
device is provided for inputting a new reconfiguration 
instruction into the reconfiguration register for a subsequent 
instruction cycle. The configuration controller is operable to 
reconfigure the data path of data through the configured 

10 execution units for the subsequent instruction cycle. An 
instruction memory is provided for storing a plurality of 
reconfiguration instructions, and a sequencer is provided for 
outputting the stored reconfiguration instructions to the 
reconfiguration register in subsequent instruction cycles in 

15 accordance with a predetermined execution sequence. 

In yet another aspect of the present invention, at least one 
of the execution units has multiple configurable data paths 
therethrough with the execution algorithm of the one execu- 
tion unit being reconfigurable in accordance with the con- 

20 tents of the instruction register to select between one of said 
multiple data paths therein. This allows the operation of each 
of said execution units to be programmable in accordance 
with the contents of the reconfiguration register such that the 
configuration controller will configure both the data path 

25 through and the executable algorithm associated with the 
one execution unit. 

BRIEF DESCRIPTION OF THE DRAWINGS 

jq For a more complete understanding of the present inven- 
tion and the advantages thereof, reference is now made to - 
the following description taken in conjunction with the 
accompanying Drawings in which: 

FIG. 1 illustrates a data flow diagram of a reconfigurable 

35 arithmetic data path processor in accordance with present 
invention; 

FIG. 2 illustrates a top level block diagram of the Mac- 
roSequencer; 

FIG. 3 illustrates a more detailed block diagram of the 
40 MacroSequencer, 

FIG. 4 illustrates a logic diagram of the input register, 

FIG. 5 illustrates a logic diagram of the input selector, 

FIG. 6 illustrates a block diagram of the multiplier- 
4 5 accumulator; 

FIG. 7 illustrates a logic diagram of the adder; 

FIG. 8 illustrates a block diagram of the shifter; 

FIG. 9 illustrates a block diagram of the logic unit; 

FIG. 10 illustrates a block diagram of the one port 
50 memory; 

FIG. U illustrates a block diagram of the three port 
memory; 

FIG. 12 illustrates a diagram of the 3-port index pointers; 
55 FIG. 13 illustrates a logic diagram of the output selector; 
FIG. 14 illustrates a logic diagram of the I/O interface; 
FIG. 15 illustrates a block diagram of the MacroSe- 
quencer data path controller; 

FIG. 16 illustrates a block diagram of the dual PLA; 
60 FIG. 17 illustrates a block diagram of basic multiplier; 
FIG. 18 illustrates an alternate embodiment of the MAC; 
FIG. 19 illustrates an embodiment of the MAC which is 
optimized for polynomial calculations; 
65 FIG. 20 has an additional four numbers generated in the 
multiplier block; 

FIG. 21 illustrates a basic multiplier-accumulator; 
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FIG. 22 illustrates an extended circuit which supports glue logic. The eight I/O pins 36 carj be configured indi- 
optimaJ polynomial calculation steps; vidua lly as input only or output only pins. These can be used 
FIG. 23 illustrates a block diagram of a multiplier block for external interface control. Process initiation and response 
with minimal support circuitry; may be provided externally via input pins 38 directly to the 
FIG. 24 is illustrates a block diagram of a multiplier- 5 MacroSequencers or it may be provided by the program- 
accumulator with Basic Core of Adder, one-port and three- mabIe PLA via the control bus 20. The RADP operates in 
port Memories; and either a configuration operating mode or a normal mode. The 
FIG. 25 illustrates a block diagram of a Multiplier- configuration mode is used for initial izing or reconfiguring 
Accumulator with Multiplicity of Adders, and one-port and lhe RADP me normal mode Ls for executing 
three-port Memories. 10 algorithms. 

Paired MacroSequencer Operational Support 

DETAILED DESCRIPTION OF THE K4 c . , ... VT „ f . A ... 

iMVPwnnM MacroSequencers may be used individually for 16-bit 

operations or in pairs for standard 32-bit addition, 
Referring now to FIG. 1, there is illustrated a block subtraction, and logic operations. When pairing, the Mac- 
diagram of the Reconfigurable Arithmetic Datapath Proces- roSequencers are not interchangeable. MacroSequencers 10 
sor (RADP) of the present invention. The RADP is com- and 12 form one pair, and MacroSequencers 14 and 16 form 
prised of four (4) MacroSequencers, 10, 12, 14 and 16, the other pair. The least significant sixteen bits are processed 
respectively. MacroSequencers 10 and 12 comprised one (1) by MacroSequencers 10 and 12. The two buses 34 are 
pair and MacroSequencers 14 and 16 comprised a second available to the MacroSequencer pairs for direct interchange 
pair. Each of the MacroSequencers has associated therewith 20 of data, 
one of four Buses 18, labeled BusO, Busl, Bus2 and Bus3, Data 

respectively BusO is associated t with MacroSequencer 10, ^ five gi 0 baldaU buses consisting of data buses 18 and 

Busl widi MacroSequencer 12, Bus2 with MacroSequencer . ^ bus24canbe simultaneouslv accessed by all of 

14 and Bus3 w.th MacroSequencer 16. These are global ^ MacroSequencers . Four of ^ buses 18, busO, busl, 

16-b.t buses. There is ako provided a control bus 20, which bus2 ^ bus3 are ^ MacroSequencers 10, 

is a 32-bit bus with 8^ils each associated with the Mac- n 14 and 16, respectively. These four buses receive data 

roSequencer 10-16 , ^h MacroSequencer also has assoo- from eilher ^ MacroS J encer vo pins 22 or an output 

ated therewith an I/O bus 22, each Bus 22 comprises 16 I/O re ^ ster (not shown) ^ to MacroSequencer. The fifth bus, 

lines to allow each of ^ MacroSequencers 10-16 to bus4 , a^ays receives data from B US4INtl5:0] pins, 

interface with 64 I/O pins. Additionally, there is provided a _ J 

16-bit input bus 24 which interfaces with each of the Control Bus 

MacroSequencers 10-16 to allow input of information The Control Bus 20 is used to communicate control, 

thereto. A dual PLA 26 is provided which has associated status, and output enable information between the MacroSe- 

therewith built-in periphery logic to control information to quencer and the PLA 26 or external MacroSequencer pins, 

the bi-directional control bus 20. The PLA 26 interfaces with 35 There are six signals associated with each MacroSequencer.. 

a control bus 20 through a 12-bit bus 28, with an external Two control signals sent to the MacroSequencer are 

20-bit control bus 30 interfacing with the control bus 20 and described here inbe low with reference to a MacroSequencer 

also with PLA 20 through an 8-bit control bus 32. Datapath Controller and are used to: 

Each of the MacroSequencers 10-16 is a 16-bit a fixed- ^ Initiate one of two available LIW sequences, 

point processor that can be an individually initiated either by Continue execution of the LIW sequence, or 

utilizing tte dual PLA26 or directly from the control bus 20. Acknowledge the MacroSequencer status flags by reset- 

The bus 18 allows data to be shared between the MacroSe- . tfae ^ m6 await ^ bits 

quencers 10-16 according to various design needs. By Status Signals 

providing the buses 18, a 16-bit data path is provided, thus 45 Two status signals, Await and Send, are sent from the 

mcreasing data throughput between MacroSequencers MacroSequencer which are described in more detail with 

AdcuUonally, each pair of MacroSequencers 10 and 12 or 14 ct lQ me MacroS cquencer Datapath Controller herein- 

and 16 are interconnected to each other by two (2) private Mqw ^ ^1^. 

16-bit buses 34, 16-bits in each direction. These private ' . 

buses 34 allow each pair of MacroScqucDccrs to be paired the Pro 8' ani Counlcr 15 *"V«**mV 

together for additional data sharing. lhc MacroSequencer is in the send state 

Each MacroSequencer is designed with a Long Ins true- it has executed a specific LIW; 

tion Word (LIW) architecture enabling multiple operations the Program Counter is continuing to sequence; 

per clock cycle. Independent operation fields in the UW lbe MacroSequencer is in the await state and it has 

control the Macr ©Sequencer's data memories, 16-bit adder, 55 executed a specific LIW; and 

multiplier-accumulator, logic unit, shifter, and I/O registers „ « . ™ 

* . . . , . ... . the Program Counter ls not continuing to sequence, and it 

so they may be used simultaneously with branch control. . 0 ... - , . . f ° ~* • 

n • 1 * j t_*i <i r is awaiting further commands before resuming. 

The pipe-lined architecture allows up to seven operations of « . , „ . , 

. . ■ K Output Enable 

the execution units during each cycle. Two output enable signals for each MacroSequencer are 

The UW architecture optimizes performance allowing «, described wilh refererlce to an Output Selection operation 

algorithms to be implemented with a small number of long describcd hereinbelow and allow for output enable to be: 

instruction words. Each Macro -Sequencer may be config- „ _ . . . 

ured to operate independenUy, or can be paired for some fr ° m the Dual 26 oepla outputs or from 

32-bit arithmetic operations. MacroSequeocer(n) output enable MSnOE pins; 

Built-in Glue Logic 6$ alwa y s 0Ut P ut; 

The Dual PLA 26 may be used for initiating stream Always input (the power up condition); or 

processes, output enable signal generation, and interface Optionally inverted. 
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Input Gocks 

Five input clocks are provided to allow the RAOP to 
process multiple data streams at different transmission 
speeds. There is one clock for each Macro-Sequencer, and a 
separate clock for the PLA 26. Each MacroSequencer can 
operate on separate data paths at different rates. The clock 
signals can be connected, for synchronization between the 
four MacroSequencers 10-16 and the Dual PLA 26. 

MacroSequencer Description 

Referring now to FIG. 2, there is illustrated a overall 
block diagram of each of MacroSequencers 10-14. The 
MacroSequencer generally is comprised of two (2) func- 
tional blocks, an arithmetic datapath block 40 and a datapath 
controller block 42. The arithmetic datapath block 40 
includes a three (3) port memory 43 and one port memory 
44, in addition to various execution blocks contained therein 
(not shown). The execution blocks are defined as the arith- 
metic datapath, represented by block 46. The three port 
memory 43 and a one port memory 44 are accessed by the 
arithmetic datapath 46. The datapath controller 42 includes 
an instruction memory 48. The three port memory 43, the 
one port memory 44 and the instruction memory 48 are all 
loaded during an Active Configuration Mode. The arithmetic 
datapath 40 receives input from the data-in bus 24 and 
provides an interface through the interface buses 18 and also 
through the dedicated pair of interfaced buses 34. Control 
signals are received on 6-bits of the control bus 20 through 
control signal bus 50 with status signals provided by 2-bits 
of the control bus 20 through status signal lines 52. 

The control signals may initiate one of two programmed 
LIW sequences in instruction memory 48 in normal oper- 
ating mode. Once a sequence begins, it will run, or loop 
indefinitely until stopped by the control signals. An await 
state programmed into the LIW sequence will stop the 
Program Counter from continuing to increment. The LIW 
sequences are a combination of data steering, data 
processing, and branching operations. Each MacroSe- 
quencer may execute a combination of branch, memory 
access, logic, shift, add, subtract, multiply-accumulate, and 
input/output operations on each clock cycle. The instruction 
memory can be reloaded dynamically at any time by tran- 
sitioning to Active Configuration Mode which will also 
initialize all registers in the entire device. 

Referring now to FIG. 3 is illustrated a block diagram of 
the MacroSequencer datapath for MacroSequencers 10-16. 
The databus 18 and da tabus 24 are input to input register 60, 
which also receives a constant as a value. There are two (2) 
registers in the input registers 60, an input register A and 
input register B. The output of the input register A is output 
on the line 62 and the output of the input register B is output 
on the line 64. The contents of input registers A and B on 
lines 62 and 64 are input to an input selector block 66. As 
will be described hereinbelow, the input selector is operable 
to provide a central portion of a pipeline structure where data 
is processed through six stages. 

There are nine (9) basic elements in the MacroSequencer 
Arithmetic Datapath. Six (6) of these are data processing 
elements and six (6) are data steering functions, of which the 
input selector 66 is one of the data steering functions. The 
data processing elements include a multiplier-accumulator 
(MAC) 68, an adder 70, a logic unit 72 and a shifter 74. The 
three port memory 43 and the one port memory 44 also 
comprise the data processing elements. The data steering 
functions, in addition to the input selector 66, also include 
the input register block 60 and an output register block 76. 

The input register block 60, as noted above, can capture 
any two (2) inputs thereto. Input selector 66 is operable to, 
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in addition to, receive the two line 62 and 64, as noted above, 
and also receive two (2) outputs on two (2) lines 78 from the 
output of the three port memory 43 and one (1) output line 
80 from the one port memory 44. It also receives on a line 

5 82 an output from the output register block 76 which is from 
a register A. The output of the register B, also output from 
the output register block 76 is output on a line 84 to the input 
selector. In addition, a value of "(T is input to the input 
selector block 66. The input selector block 66 is operable to 

10 select any three operands for data processing elements. 
These are provided on three buses, a bus 86, a bus 88, and 
a bus 90. A bus 86 is input to the MAC 68, the adder 70 and 
the logic unit 72, with bus 88 input to the MAC 68, adder 70 
and logic unit 72. The Bus 90 is input only to a shifter 74. 

is The MAC 68 also receives as an input the output of the 
register B on a line 92 and the output of the one port memory 
44. The output of MAC 68 comprises another input of the 
adder 70, the out put of the adder 70 input to the output 
selector block 76. The logic unit 72 has an output that is 

20 connected to the output selector 76, as well as a shifter 74 
having an output to the output selector block 76. The output 
selector block 76 also receives as an input the output from 
register B in the input register block 60. The output of 
register B is connected to the output one of the MacroSe- 

25 quencer pier bus 34, whereas the output of register B is 
output to the input of an interface block 96 which is 
connected to one of the four data buses 18 and the I/O bus 
22. The I/O bus 22 also comprises an input to the output 
selector 76. Therefore, the output selector/register block 76 

30 is operable to select which two of the data processing 
elements are stored, as will be described in more detail 
hereinbelow. 

Each of the four (4) parallel data processing units, the 
MAC 68, Adder 70, logic unit 72 and shifter 74, runs in the 

35 parallel with the others allowing the execution of multiple 
operations per cycle. Each of the data processing functions 
in the MacroSequencer datapath will be discussed herein- 
below in detail. However, they are controlled by the opera- 
tion fields in the MacroSequencers LIW register. It is noted 

40 that, as described herein, the terms "external" and "internal" 
do not refer to signals external and internal to the RADP; 
rather, they refer only to signals external and internal to an 
individual MacroSequencer. 
The 16-bit input registers in register block 60 comprise 

45 InRcgA and InRegB. There are six external inputs and one 
internal input available to the Input Registers. The input 
registers are comprised of an 8-to-l multiplexer 100 with the 
output thereof connected to a register 102, the output of 
register 102 comprising the InRegA output Also, an 8-to-l 

50 multiplexer 104 is provided having the output thereof con- 
nected to a register 106, which provides the output InRegB. 
Seven of the inputs of both multiplexers 100 and 104 
connected to six inputs, one input being the 16-bit input of 
bus 24, one being a 16-bit constant input bus 108, four being 

55 the 16-bit data buses 18 and one being the pair bus 34, which 
is also a 16-bit bus. The constant is a value that varies from 
"0" to "65535", which is generated from the LIW register 
bits. The eighth input of the multiplexor 100 is connected to 
the output of register 102, whereas the 8 input of register 106 

60 is connected to the output of register 106. 

The Constant introduces 16-bit constants into any calcu- 
lation. The constant of the MacroSequencer shares internal 
signals with the MacroSequencer Controller as well as the 
MAC 68, the Shifter 74, and the Logic Unit 72. Since the 

65 Constant field of the LIW is shared, care must be taken to 
insure that overlap of these signals does not occur. The 
RADP Assembler detects and reports any overlap problems. 
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lopul Selector 

Referring now to FIG. 5, there is illustrated a block 
diagram of the input selector block 66. The input selector 
block 66 is comprised of a four-to-one multiplexer 110, a 
six-to-one multiplexer 112 and a two-to-one multiplexer 
114. The multiplexer 112 is connected to one input of an 
Exclusive OR gate 116. The output of multiplexer 110 is 
connected to a bus 118 to provide the In Bus A signals, the 
output of Exclusive OR gate 116 is connected to a bus 120 
to provide the InBusB signals and the output of multiplexer 
114 is connected to a bus 122 to provide the InBusC signals. 
Inputs to the Input Selector 66 include: 

InRegA and InRegB from the Input Register 60, 
OutRegA and OutRegB from the Output Register 76, 
meml and mem2 from the Three-Port Memory read ports 

1 and 2 respectively on lines 78, 
memO from the One-Port Memory read port on line 80, 
and 

Constant '0' which is generated in the Input Selector 66. 

Control signals from the MacroSequencer Controller (not 
shown) determine which three of the eight possible inputs 
are used and whether InBusB is inverted or not. The Input 
Selector 66 is automatically controlled by assembly lan- 
guage operations for the MAC 68, Adder 70, Shifter 74, and 
Logic Unit 72 and does not require separate programming. 
The input selections are controlled by the same assembly 
operations used by the MAC 68, Adder 70, Logic Unit 72 
and Shifter 74. 

Multiplier-Accumulator 

Referring now to FIG. 6, there is illustrated a block 
diagram of the MAC 78. The Multiplier-Accumulator 
(MAC) 78 is a three-stage, 16 by 8 multiplier capable of 
producing a full 32-bit product of a 16 by 16 multiply every 
two cycles. The architecture allows the next multiply to 
begin in the first stages before the result is output from the 
last stage so that once the pipe-line is loaded, a 16 by 8 result 
(24-bit product) is generated every clock cycle. 

The input to the MAC 78 is comprised of an Operand A 
and an Operand B. The Operand A is comprised of the output 
of the One-Port memory 44 on the bus 80 and the InBusA 
86. These are input to a three-to-one multiplexer 126, the 
output thereof input to a register 130, the output of the 
register 130 connected to a 16-bit bus 132. The output of the 
register 130 is also input back as a third input of the 
multiplexer 126. The Operand B is comprised of the Out- 
RegB bus 84 and the InBusB bus 88. These buses are input 
to a three-to-one muldplexer 134, the output thereof con- 
nected to the register 136. They are also input to a 2-input 
multiplexer 138, the output thereof input to a register 140, 
the output of register 140 input as a third input to the 
multiplexer 130. The output of registers 130 and 136 are 
input to a 16x8-bit multiplier 142 which is operable to 
multiply the two Operands on the inputs to provide a 24-bit 
output on a bus 144. This is input to a register 146, the output 
thereof input to a 48-bit accumulator 148. The output of the 
accumulator 148 is stored in a register 150, the output 55 
thereof fed back to the input of the accumulator 148 and also 
to the input of a four-to-two multiplexer 152, the output of 
the register 150 connected to all four inputs of multiplexer 
152. The multiplexer 152 then provides two outputs for 
input to the Adder 70 on buses 154 and 156. The operation 
of the MAC 68 will be described in more detail here in below. 
Either or both operands may be signed or unsigned. The 
multiplier input multiplexers 126, 134 and 138 serve two 
purposes: 

1) They align the high or low bytes from Operand B for 
the multiplier which allows 16 by 8 or 16 by 16 
multiply operations; and 
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2) They allow each operand to be selected from three 
different sources: 

Operand A is selected from the One-Port Memory 44, 
InBusA 86, or Operand A from the previous cycle. 

Operand B is selected from the high byte of OutRegB 84, 
InBusB 88, or the least significant byte of the previous 
Operand B. 

The Multiplier Stage 142 produces a 24-bit product from 
the registered 16-bit Operand A and either the most signifi- 
cant byte (8-bits) or the least significant byte of Operand B. 
The Accumulator Stage 148 aligns and accumulates the 
product. Controls in the accumulator allow the product to be 
multiplied by: 1 when <wcight> is low, or 28 when 
<weight> is high. The result is then: added to the result in the 
accumulator 148 when <enable> is acc, placed in the accu- 
mulator replacing any previous value when <enable> is clr, 
or held in the accumulator in lieu of mult3 operation. 

Cycles per Multiply 

The number of cycles required for Multiplies and MACs 
are shown in Tables 1 and 2. 

TABLE 1 
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Cycles Between New Multiplies 




Multiply 


Accuracy 


Cycles 


16 by 8 


16 bits 


1 


24 bite 


2 


16 by 16 


16 bits 


2 


16 by 816 by 832 


3 




bits 




TABLE 2 


Cycles Between New Multiply - Accumulates of n 




Products 




Multiply 


Accuracy 


Cycles 


16 by 8 


16 bits 


n 


32 bits 


D+ 1 




48 bits 


n + 2 


16 by 16 


16 bus 


2n 


32 bits 


2n + 1 




48 bits 


2n + 2 



The MAC internal format is converted to standard integer 
format by the Adder 70. For this reason, all multiply and 
multiply-accumulate outputs must go through the Adder 70. 

If a 16- by 8-bit MAC 68 is desired, new operands are 
loaded every cycle. The Multiplier 142 results in a 24-bit 
product which is then accumulated in the third stage to a 
4-bit result. This allows at least 2 24 multiply-accumulate 
operations before overflow. If only the upper 16-bits of a 
24-bit result are required, the lower eight bits may be 
discarded. If more than one 16-bit word is extracted, the 
accumulated result must be extracted in a specific order. 
First the lower 16-bit word is moved to the Adder 70, 
followed in order by the middle 16 bits and then the upper 
16 bits. This allows at least 2 1S of these 16- by 16-bit 
multiply-accumulate operations before overflow will occur. 

Adder 

Referring now to FIG. 7, there is illustrated a block 
diagram of the Adder 70. The Adder 70 produces a 16-bit 
result of a 16- by 16-bit addition, subtraction, or 16-bit data 
conversion to two's complement every cycle. The Adder 70 
is also used for equality, less-than and greater-than compari- 
sons. The Adder 70 is comprised of two Adder pipes, an 
Adder pipe 160 and Adder pipe 162. There are provided two 
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multiplexers 164 and 166 on the input, with multiplexer 164 Arithmetic Shift Right by n bits. This is the same as 

receiving the multiplier output signal on bus 154 and the logical shift light with the exception that the bits shifted 

multiplexer 166 receiving the multiplier output on bus 156. in are filled with Bit[15], the sign bit. This is equivalent 

Additionally, multiplexer 164 receives the signal on the to dividing the number by 2". 

InBusA 86 with multiplexer 166 receiving as an input the 5 Rotate Shift Left by n bits. The bits shifted out from the 

signals on InBusB 88. The output of multiplexers 164 and highest ordered bit are shifted into the lowest ordered 

166 are input to the Adder pipe 160, the output thereof being bU. 

input to a register 168. The output of register 168 is input to Normalized Shift Right by 1 bit. All bits are shifted one 

the Adder pipe tc > 162, which also receives an external carry lowef i(J or(fcr The Iowest bit ^ lost and tbe highesl bit 

N-bit, a signal indicating whether the operation * a 32-bit or 1Q fa ^ fe me ^ of ^ MdtT 

16-bit operauon and as^eaVunsigned bit. The Adaerpipe * > wbeQ two 16-bit words 

to 162 provides a 4-bit output to a register 170 which .... * n u-. u 

combines the Adder status flags for equality, overflow, sign , ™ * P^uce a 17-bit result. 

andcarryandalsoalo-bitou^utselectoronabusm.The Arithmetic and Rotate ; shifts may shift zero to 

architecture allows the next adder operation to begin in the fifteen blts 35 determined by tbe Shift Length control signal, 

first stage before tbe result is output from the last stage. 15 Lo&c Umt 

The input multiplexers 164 and 166 select one of two Referring now to FIG. 9, there is illustrated a block 

sources of data for operation by the Adder 70. The operands diagram of the Logic Unit 72. The Logic Unit 72 is able to 

are selected from either InBusA 86 and InBusB 88, or from perform a bit-by-bit logical function of two 16-bit vectors 

the Multiplier 68. Select InBusA 86 and InBusB 88 are for a 16-bit result All bit positions will have the same 

selected for simple addition or subtraction and setting the 20 function applied. All sixteen logical functions of 2 bits are 

Adder Status flags. The multiplier 68 outputs, MultOutA 154 supported. The Logic Function controls determine the func- 

and MultOutB 156, are selected for conversion. The first ^on performed. Tbe Logic Unit 72 is described in U.S. Pat. 

adder stage 160 receives the operands and begins the opera- No - 5,394,030, which is incorporated herein by reference, 

uon. The second adder stage 162 completes the operation One-Port Memory 

and specifies the output registers in the Output Selector 25 Referring now to FIG. 10, there is illustrated a block 

where the result will be stored. The two adder stages 160 and diagram of the One-Port Memory 44. The One-Port Memory 

162 may be controlled separately for addition and subtree- 44 is comprised of a random access memory (RAM) which 

tion operations. is a 32x16 RAM. The RAM 44 receives on the input thereof 

The Adders 70 from a pair of MacroSequencers may be the data from the OutRegA bus 82. The output of the RAM 

used together to produce 32 bit sums or differences. There is 30 44 is input to a multiplexer 180, the output thereof input to 

no increase in the pipe-line latency for these 32 bit opera- * register 182, the output of the register 182 connected to the 

tions. The Adder 70 may be placed in the sign or unsigned bus 80. Also, the bus 80 is input back to the other input of 

mode the muluplexer 180. A 5-bit address for the RAM 178 is 

Adder Status Bits— The Equal, Sign, Overflow, and Carry received on a 5-bit address bus 184. The One-Port Memory 

flags are set two cycles after an addition operation (addl or 3 * 44 supports single-cycle read and single-cycle write 

subl) occurs and remain in effect for one clock cycle: operations, but not both at tbe same time. There are 32 

The Equal flag is set two cycles later when the two addressable 16-bit memory locations in the One-Port 

operands are equal during an addition operation; Memor V 44 ^ re P te / 182 " a f V™* register provided 

The Overflow flag is set when the result of an addition or to store and mamtain the ; result of a read operation until a 

subtraction remits in a 16-bit out-of-range value; 40 ^. executcd * Read ^ ^ 0001101 

„. . . - , - . j • * whether reading or writing memory is requested. No opera- 

When the adder 70 is configured for unsigned integer ^ fa rfo ^ d ^ ^ tbe Read and Write Controls are 

arithmetic, Overflow-Carry. Range-0 to 65535; Only one operation, read or write, can occur per 

When the adder is configured for signed integer cy C ic. m dex registers provides me read and write aAlress to 

arithmetic, Overflow-Carry XOR Sign. Range— A$ the 0 ne-Port Memory. The index register may be 

32768 to +32767; incremented, decremented, or held with each operation. 

The Sign flag is set when the result of an addition or tne mdex operation and the read or write operation are 

subtraction is a negative value; controlled by the MacroSequencer UW. 

The Carry flag indicates whether a carry value exists. Three-Port Memory 

The Adder 70 may be used to convert the data in the 50 Referring now to FIG. 11, there is illustrated a block 

Accumulator 148 of the Multiplier 142 to standard integer diagram of a Three-Port Memory 43. The Three-Port 

formats when inputs are selected from the output of the Memory 43 is comprised of a 16x16 RAM 186, which 

MAC 68. Since the Accumulator 148 is 48 bits, the multi- receives as an input the OutRegB contents as an input on the 

plier's accumulated result must be converted in a specific bus 84 and provides two outputs, one output providing an 

order: lower-middle for 32-bit conversion, and lower- 55 input to a multiplexer 188 and one output providing an input 

middle-upper for 48-bit conversion. Once the conversion to a multiplexer 190. The output of multiplexer 188 is input 

process is started, it must continue every cycle until com- to a register 192 and the output of the multiplexer 190 is 

pleted. Signed number conversion uses bits 30:15. i npu t to a register 194. The output of register 192 provides 

Shifter the meml output on the line 78 and the output of register 194 

Shift Mode signals control which Shifter functions are ^ provides the mem2 output on buses 78, buses 78 each 

performed: comprising the 16-bit bus. Additionally, the output of reg- 

Logical Shift Left by n bits (shift low order bits to high ister 192 is fed back to the other input of multiplexer 188 and 

order bits). The data shifted out of the Shifter is lost, the output of register 194 is fed back to the input of the 

and a logical '0* is used to fill the bits shifted in. multiplexer 190. There are two read operations that are 

Logical Shift Right by n bits (shift high order bits to low 65 provided by the RAM 186 and they are provided by two read 

order bits). The data shifted out of the Shifter is lost, addresses, a Readl address on a 4-bit bus 196 and a 4-bit 

and a logical '0' is used to fill the bits shifted in. read address on a bus 198, labeled Read2. The write address 
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is provided on a 4-bit bus 200. The Three-Port Memory 43 
supports two read and one write operation on each clock 
cycle. The two read ports may be used independently; 
however, data may not be written to the same address as 
either read in the same clock cycle. Four index registers are 5 
associated with the Three-Port Memory. Two separate reg- 
isters are provided for write indexing: Write Offset and Write 
Index. These two registers may be loaded or reset simulta- 
neously or independently. Write Offset provides a mecha- 
nism to offset read index registers from the Write Index by 10 
a fixed distance. Increment and Decrement apply to both 
write registers so that the offset is maintained. The two Read 
Index registers may be independently reset or aligned to the 
Write Onset. 

Smart Indexing 15 
Referring now to FIG. 12, there is illustrated a block 
diagram of the Three- Port Memory Index Pointers. Smart 
Indexing operates multiple memory addresses to be 
accessed. This is particularly useful when the data is sym- 
metrical. Symmetrical coefficients are accessed by providing 20 
the Write Offset from the center of the data and aligning both 
Read Indices to the Write Offset. The Read Indices may be 
separated by a dummy read. Additional simultaneous reads 
with one index incrementing and the other decrementing 
allows for addition or subtraction of data that uses the same 25 
or inverted coefficients. Each index has separate controls to 
control its direction. Each index may increment or 
decrement, and/or change its direction. The change in each 
index register's address takes place after a read or write 
operation on the associated port. Smart Indexing is ideal for 30 
Filter, and DCT applications where pieces of data are taken 
from equal distance away from the center of symmetrical 
data. The Smart Indexing method used in the Data Memory 
allows symmetrical data to be multiplied in half the number 
of cycles that would have normally been required. Data from 35 
both sides can be added together and then multiplied with 
the common coefficient For example, a 6-tap filter which 
would normally take 6 multiplies and 7 cycles, can be 
implemented wiPTa si^e^^a^ro^ g^nce r and on ly 
requires 3 cycles to romjfle^ 8-point 40 

DCT which normally requires 64"multiplies and 65 cycles 
can be implemented with a single Macro-Sequencer and 
only requires 32 clock cycles to complete the calculation. 
Output Selector 

Referring now to FIG. 13, there is illustrated a block 45 
diagram of the output selector 76. The output selector 76 is 
comprised of two multiplexers, a 4-input multiplexer 202 
and a 6-input multiplexer 204. Both multiplexers 202 and 
204 receive the outputs from the Adder 70, Logic Unit 72 
and Shifter 74 on the respective 16-bit buses. The output of 50 
multiplexer 202 is input to a register 206, the output thereof 
providing the 16-bit signal for the OutRegA output on bus 
82. This bus 82 is fed back to the remaining input of the 
multiplexer 202 and also back to the input selector 66. The 
multiplexer 204 also receives as an input InRegB contents 55 
on bus 64 and the MacroSequencer share the data on the bus 
34. The output of the multiplexer 204 is input to a register 
208, the output thereof comprising the OutRegB contents on 
the bus 84, which is also input back to an input of the 
multiplexer 204 and to the input selector 66. The Output 60 
Selector 76 controls the state of output registers OutRegA 
206 and OutRegB 208 and controls the state of the MSnl/ 
O[15:0] bus pins. The Output Selector 76 multiplexes five 
16-bit buses and places the results on the two 16-bit output 
registers 206 and 208 which drive the two on-chip buses 82 65 
and 84 and the MacroSequencer I/O pins 22. The Output 
registers may be held for multiple cycles. 



036 Bl 

12 

I/O Interface 

Referring now to FIG. 14, there is illustrated a block 
diagram of the MacroSequencer I/O interface. The contents 
of the output register 206 on the bus 82 are input to a 2-input 
multiplexer 210, the other input connected to bus 203 to 
provide the MacroSequencer I/O data. The output of mul- 
tiplexer 210 provides the data to the associated one of the 
four buses 18, each being a 16-bit bus. Additionally, the 
16-bit bus 82 is input to a driver 212 which is enabled with 
an output enable signal OE. The output of driver 212 drives 
the I/O bus 22 for an output operation and, when it is 
disabled, this is provided back as an input to the multiplexer 
204. The output enable circuitry for the driver 212 is driven 
by an output enable signal MsnOE and a signal OEPLA 
which is an internal signal from the PLA 26. These two 
signals are input to a 2-input multiplexer 214, which is 
controlled by a configuration bit 5 to input multiplexer 216, 
the other input connected to a "1" value. This multiplexer is 
controlled by a configuration bit 6. The output of multiplexer 
216 drives one input of the 2-input multiplexer 218 directly 
and the other input thereof through an inverter 220. The 
multiplexer 218 is controlled by the configuration bit 7 and 
provides the OE signal to the driver 212. The configuration 
bit 4 determines the state of the multiplexer 210. The I/O 
Interface selection for each MacroSequencer determines: 
Input source for data busn and the output enable configu- 
ration. 

Busn Selection 

The input data on the buses 18, busn, is selected from the 
MSnI/O[l5:0] pins 22 or the OutRegA 206 output of 
MacroSequencer(n) by configuration bit 4. When the 
MacroSequencer(n)'s associated busn is connected to the 
OutRegA 206 signal, the MacroSequencer still has input 
access to the MSnl/O pins 22 via the Output Selector. 

Output Enable Control 

Output Enable to the MSnl/O pins is controlled by 
configuration bit selections. Inputs to the output enable 
control circuitry include the MSnOE pin for 
MacroSequencer(n) and the oepla[n] signal from the PLA 
26. The Output Selector diagram for the output enable 
circuitry represents the equivalent of the output enable 
selection for configuration bits 5, 6, and 7 in the normal 
operating mode. 

MacroSequencer Datapath Controller 

Referring now to FIG. 15, there is illustrated a block 
diagram of the MacroSequencer Datapath Controller 42. 
The MacroSequencer Datapath Controller 42 contains and 
executes one of two sequences of Long Instruction Words 
(LTWs) that may be configured into the instruction memory 
48. The Datapath Controller 42 generates UW bits which 
control the MacroSequencer Arithmetic Datapath. It also 
generates the values for the One-Port and Three-Port index 
registers. The Datapath Controller 42 operation for each 
MacroSequencer is determined by the contents of its LTW 
register and the two control signals. 

The Datapath Controller 42 has associated therewith a 
sequence controller 220 which is operable to control the 
overall sequence of the instructions for that particular Mac- 
roSequencer. The sequence controller 220 receives adder 
status bits from the Adder 70 which were stored in the 
register 170 and also control signals from either an internal 
MacroSequencer control bus 222 or from the PLA 26 which 
are stored in a register 224. The contents of the register 224 
or the contents of the bus 222 are selected by a multiplexer 
226 which is controlled by the configuration bit 8. There are 
provided two counters, a counterO 228 and a counterl 230 
which are associated with the sequence controller 220. The 
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instruction memory 48 is controlled by a program counter 
232 which is interfaced with a stack 234. The program 
counter 232 is controlled by the sequence controller 220 as 
well as the stack 234. The instruction memory 48, as noted 
above, is preloaded with the instructions. These instructions 
are output under the control of sequence controller 220 to an 
L1W register 236 to provide the LIW control bits which 
basically configure the entire system. In addition, there are 
provided read addresses, with an index register 238 storing 
the address for the One- Port address on bus 84, an index 
register 240 for storing the read address for the Three-Port 
read address on bus 196, an index register 242 for storing a 
read address for the Three-Port read address bus 198, an 
index register 244 for storing the write address for the 
Three-Port write address bus 200. These are all controlled by 
the sequence controller 220. The status bits are also provided 
for storage in a register 248 to provide status signals. 

The LIW register 236, as noted above, contains the 
currently executing LIW which is received from the instruc- 
tion memory 48, which is a 32x48 reprogrammable memory. 
The program counter 232 is controlled by the stack 234 
which is a return stack for "calls", and is operable to hold 
four return addresses. 

The controller 48 accepts control signals from the PLA 
CtrlReg signals or external MSnCTRL pins which initiates 
one of two possible LIW sequences. It outputs Send and 
Await status signals to the PLA 26 and to external 
MSnSEND and MSnAWAIT pins. 

The Datapath Controller 42 is a synchronous pipelined 
structure. A 48-bit instruction is fetched from instruction 
memory 48 at the address generated by the program counter 
232 and registered into the LIW register 236 in one clock 
cycle. The actions occurring during the next clock cycle are 
determined by the contents of the LIW register 236 from the 
previous clock cycle. Meanwhile, the next instruction is 35 
being read from memory and the contents of the LIW 
register 236 are changed for the next clock cycle so that 
instructions are executed every clock cycle. Due to the 
synchronous pipe-lined structure, the Datapath Controller 42 
will always execute the next instruction before branch ^ 
operations are executed. The program counter 232 may be 
initiated by control signals. It increments or branches to the 
address of the LIW to be executed next 

Hie Adder status signals, Stack 234 and the two Counters 
228 and 230 in the Datapath Controller support the program 
counter 232. Their support roles are: 

the Adder status bits report the value of the Equal, 
Overflow, and Sign, for use in branch operations; 

the Stack 234 contains return addresses; and 

countcrO 228 and Countcrl 230 hold down loop-counter 
values for branch operations. 
The five index registers 238-246 hold write, read, and write 
offset address values for the One -Port and Three-Port memo- 
ries. The write offset index register 246 is used for alignment 
of the two read index registers, and it holds the value of an 
offset distance from the Three -Port Memory 63 write index 
for the two read indices. 

Control Signals 

The MSn Direct Control and Status pins illustrated in 
FIG. 2 are the control and status interface signals which 
connect directly between the pins and each MacroSe- 
quencer. The direct control signals are MSnCTRL[l:0] and 
MSnOE. The direct status signals are MSnAWAIT and 
MSnSEND. Alternatively, the MacroSequencers 10-16 may 
use control signals from the Dual PLA 26. The Dual PLA 
also receives the MacroSequencer status signals. Two Con- 
trol signals for each MacroSequencer specify one of four 



control commands. They are selected from either the 
MSnCTRL[l:0] pins or from the two PLA Controln signals. 
The control state of the MacroSequencer on the next clock 
cycle is determined by the state of the above components 
and the value of these C6ntroln[l:0] signals. 
The four control commands include: 
SetSequenceO 
SetSequenceO sets and holds the Program Counter 232 
to '0' and resets the Send and Await state registers to 
4 0* without initializing any other registers in the 
MacroSequencer. Two clock cycles after the SetSe- 
quenceO is received, the Datapath Controller 42 will 
execute the contents of the LIW register 236 (which 
is the contents of the LIW memory at address '0') 
every clock cycle until a Run or Continue control 
command is received. 
SetSequence2 
SetSequence2 sets and holds the Program Counter 232 
to *2 f and resets the Send and Await state registers to 
*0* without initializing any other registers in the 
MacroSequencer. Two clock cycles after the SetSe- 
quenceO is received, the Datapath Controller 2 will 
execute the contents of the LIW register 236 (which 
is the contents of the LIW memory at address '2') 
every clock cycle until a Run or Continue control 
command is received. 

Run 

Run permits normal operation of the Datapath Control- 
ler 42. This control command should be asserted 
every cycle during normal operation except when 
resetting the Send and/or Await flags, or initiating an 
LIW sequence with SetSequenceO or SetSequence2. 
Continue 

Continue resets both the Send and Await status signals 
and permits normal operation. If the Await State was 
asserted, the Program Counter 232 will resume nor- 
mal operation on the next cycle. 
If an await operation is encountered while the Continue 
control command is in effect, the Continue control command 
will apply, and the await operation will not halt the program 
counter 232, nor will the Await status register be set to a ' 1*. 
Therefore, the Continue control command should be 
changed to a Run control command after two clock cycles. 
If a send operation is encountered while the Continue 
45 control command is in effect, the Continue control command 
will apply, and the Send status register will not be set to a 
•1. 

The following table summarizes the four control com- 
mand options for Controln[l:0] which may be from Ctrl- 
PLAn or from MSnCTRL pins: 

TABLE 3 
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Control 






»[1:0] 


Command 


Description 


00 


Run 


Normal Operating Condition 


0 1 


Continue 


Reset Send and Await registers. 


10 


SetSequenceO 


The program counter is set to '0*. 
Resets the Send and Await registers. 
This must be asserted for at [east two 
cycles. 


1 1 


SctSequence2 


The program counter is set to *2\ 
Resets the Send and Await registers. 
This must be asserted for at Least two 
cycles. 



By allowing two sequence starting points, each MacroSe- 
quencer can be programmed to perform two algorithms 
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without reloading the sequences. The two PLA Controlo 
signals are synchronized within the MacroSequencer. The 
two MSnCTRL pin signals are not synchronized within the 
Macro-Sequencer; therefore, consideration for timing 
requirements is necessary. 
Status Signals 

There are two single-bit registered status signals that 
notify the external pins and the PLA 26 when the MacroSe- 
quencer has reached a predetermined point in its sequence of 
operations. They are the Await and Send status signals. Both 
of the Status signals and their registers are reset to '0' in any 
of these conditions: during Power On Reset, active configu- 
ration of any part of the RADP, or during Control States: 
SetSequenceO, SetSequence2, or Continue. 

When an await operation is asserted from the LIW 
register, the MacroSequencer executes the next instruction, 
and repeats execution of that next instruction until a Con- 
tinue or SetSequence control command is received. The 
await operation stops the program counter from continuing 
to change and sets the Await status signal and register to '1*. 
A Continue control command resets the Await status signal 
and register to '0* allowing the program counter 232 to 
resume. When send operation is asserted, the Send status 
signal and register is set to * 1 ' and execution of the sequence 
continues. The program counter 232 is not stopped. A 
Continue control command resets the Send status signal and 
register to '0'. Status signals are resynchronized by the Dual 
PLA 26 with the PLACLK. 

The Adder status bits, Equal, Overflow, and Sign are 
provided for conditional jumps. 

Long Instruction Word Register 

The purpose of the 48-bit LIW Register 236 is to hold the 
contents of the current LIW to be executed. Its bits are 
connected to the elements in the datapath. The LIW register 
236 is loaded with the contents of the instruction pointed to 
by the Program Counter 232 one cycle after the Program 
Counter 232 has been updated. The effect of that instruction 
is calculated on the next clock cycle. Each of the MacroSe- 
quencers 10-16 is composed of elements that are controlled 
by Long Instruction Word (UW) bits. UWs are pro- 
grammed into Macro-Sequencer Instruction memory 48 
during device configuration. The Datapath Controller 
executes the LIWs which control the arithmetic datapath. 
Some of these fields are available in every cycle. Some are 
shared between more than one operational unit. The follow- 
ing operational fields are available on every cycle: 

One-Port Memory access 

Three-Port Memory access 

Input Register multiplexers 

Input Mux A, B, C 

Output multiplexers 

Adder 1 

Adder 2 

These operational fields are available on every cycle except 
when a Constant is required by an in operation: 
Multiplier 

Multiplier-Accumulator 
These operational fields conflict with each other. Only one 
is allowed in each LIW: 

Shifter 

Logic Unit 

Datapath Controller (if parameters are required) 
Program Counter 

The Program Counter 232 is a 5-bit register which 
changes state based upon a number of conditions. The 



program counter may be incremented, loaded directly, or set 
to '0' or ' 2\ The three kinds of LIW operations which affect 
the MacroSequencer Program Counter explicitly arc: 
Branch Operations, 

SetSequenceO and SetSequence2 operations, and 
Await status operations. 
The Program Counter 232 is set to zero '0': 
During power-on Reset, 

During Active configuration of any part of the RADP, 
During the SetSequenceO control command, 
When the Program Counter 232 reaches the value '31', 
and the previous LIW did not contain a branch to 
another address, or 
Upon the execution of a branch operation to address '0*. 
Control Signal Effects: 

The Controln[l:0] signals are used to reset the program 
counter to either '0* or *2* at any time with either SetSe- 
quenceO or SetSequence2 respectively. A Run control com- 
mand begins and maintains execution by the program 
counter according to the UW. A Continue control state 
resumes the program counter operation after an Await state 
and resets the Send and Await registers to '0' on the next 
rising clock signal A Continue control command after a 
Send status state resets the Send register to '0* on the next 
rising clock signal. 
Status Signal Effects: 

The Await status register is set to *V and the Program 
Counter 232 stops on the next clock cycle after an await 
operation is encountered. A Continue control state resets the 
Send and Await registers and permits the Program Counter 
232 to resume. The Send status register is set to *1' on the 
next clock cycle after a send operation. In the Send status, 
the Program Counter continues to function according to the 
LIW. A Continue control state is required to reset the Send 
register. 

Branch Operations 

The LIW register may contain one Branch Operation at a 
time. Conditional Branches should not be performed during 
the SetSequence control commands to insure predictable 
conditions. 

45 TABLE 4 



10 



15 



20 



25 



30 



35 



40 



Branch Operation Assembly Instruction 



Result 

in the Program Counter 



50 



Unconditional branch jump <address> 



Branch on loop 
CounterO or loop 
Counter 1 not equal to 
'0* 



55 



60 



Branch on an Adder 
status condition: 
Equal, Overflow, 
Sign 

Call subroutine 



jumpcounterO 
<address> 
jumpcounterl 
<addreas> 



jumpequal <addrcsa> 

jumpoverftow 

<address> 

jumps ign <address> 
call <addres6> 



65 



Program Counter is set to 
<address>. 

Program Counter is set to 
<address> if the respec- 
tive branch loop counter 
has a ooo -zero value. 
The respective loop 
counter will then be de- 
cremented in the next 
clock cycle. 
Program Counter is set 
<address> if the Adder 
status bits agree with the 
branch condition. 
The current address prus 
T in the Program 
Counter is pushed onto 
the Stack. The contents 
of the Program Counter 
on the next clock cycle 
will be set to the address 
in the UW. 



17 



US 6,247,036 Bl 



18 



TABLE 4-contioued 





Result 


Branch Operation 


Assembly Instruction in the Program Counter 


Return from 


return Toe address from the top 


subroutine operation 


of the Stack is popped 




into the Program 




Counter. 



Instruction Memory 

The Instruction memory 48 consists of thirty-two words 
of 48-bit RAM configured according to the MacroSequencer 
assembly language program. The Instruction memory 48 is 
not initialized during Power On Reset. For reliability, the 
UW RAM must be configured before MacroSequencer 
execution begins. Bit fields in the LIW Registers control 
datapath operations and program flow. 

GounterO and Counterl 

The counters 228 and 230 are 5 -bit loop counters. Both 
loop counters are filled with 'ffs during Power On Reset and 
active configuration of any component in the RAD P. 
CounterO and Counterl may be loaded by the setcounterO 
and setcounterl operations respectively. The jumpcounterO 
and jumpoounterl operations will decrement the respective 
counter on the next clock cycle until the Counter value 
reaches '0*. The SetSequenceO and SetSequence2 control 
signals do not alter or reset the loop counters. Therefore, the 
counters should be initialized with setcounterO and set- 
counterl operations before they are referenced in the pro- 
gram. 

Stack 

The Stack 234 holds return addresses. It contains four 
5-bit registers and a 2-bit stack pointer. After Power On 
Reset or the active configuration of any component in the 
RADP, the stack pointer and all of the 5-bit registers are 
initialized to '0's. A call performs an unconditional jump 
after executing the next instruction, and pushes the return 
address of the second instruction following the call into the 
Stack 234. A return operation pops the return address from 
the Stack 234 and into the Program Counter 232. The call 
and return operations will repeat and corrupt the Stack 234 
if these operations are in the next LIW after an await 
operation because the program counter 232 is held on that 
address, and the MacroSequencer repeats execution of the 
LIW in that address. 

Index Registers 

The LIW Register 236 controls the five index registers 
which are used for data memory address generation. The 
index register 238 holds the One-Port Memory address. The 
other four index registers 240-246 hold Three-Port Memory 
address information. During Power On Reset or the active 
configuration of any component in the RADP, all index 
register bits are reset to '0's. The control states, Run, 
Continue, SetSequenceO or SetSequence2 do not effect or 
reset the index registers. Each clock cycle that a relevant 
memory access is performed, the memory address can be 
loaded, incremented, decremented or held depending upon 
the control bit settings in each index register. 

MacroSequencer Configuration Bits 

In each MacroSequencer there are nine programmable 
configuration bits. They are listed in the table below. The 
three signed/unsigned related bits are set with directives 
when programming the MacroSequencer. The others are set 
by the software design tools when the configuration options 
are selected. 



TABLE 5 



MacroSequencer Configuration Bits 

Functional 
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20 



Bit 


Block 


Function 


If Bit - 0 


If Bit ~ 1 


0 


Multiplier 


Must operand 


A is unsigned. 


A is signed. , 






A sign 






1 


Multiplier 


Must operand 


B is unsigned. 


B is signed. 






B sign 






2 


Adder 


Signed/Un- 


f 1k«'imm< AAA 

unsigned Ana 


Cim»r1 A/1/1 

oigneo aoo 






signed Bit 






3 


Adder 


1771 fi Rit 


1 d hit Dabirath 


32 bit Datapath 








mode 


mode 


4 


Data Bus 


Select 


Busn inputs are 


Busn inputs are 




Connections 


OutRegA or 


from OutRegA of 


from MSnl/O 






MSnl/O pins 


MacroSequcncer(n) 


pins 






for Macro-Se- 










quencer busn 










inputs 






5 


I/O Inter&ce 


Output Enable 


OE from MS/iOE 


OE from PLA 






Select 


pin 




6 


I/O Inter&ce 


Select OE sig- 


OE - OE 


OE - 4 r 






nal or * 1* 






7 


I/O Inter&ce 


OE Polarity 


OE-OE 


OE-OE 






Select 






8 


Datapath 


Control(l:u] 


Control! 1:0] from 


ControlIlrO] 




Controller 


source select 


MSrtCTRL[lK)] 


from PLA0 








pins 


CtrlPLAiflrO] 



'1* - logical one, *0* - logical zero 

The configuration bits are configured with the instruction 
memory 48, where bits 0 through 8 of the 16-bit program 
30 data word are the nine configuration bits listed above. 
Dual PLA Description 

Referring now to FIG. 16, there is illustrated a block 
diagram of the dual PLA 26. There are provided two PLAs, 
a PLAO 260 and a PLA1 261. Each of the PLAs is comprised 

35 of an input selector 264 for receiving seven inputs. Each 
receives the 16-bit BUS41N bus 24 which is a 16-bit bus, the 
send status bits on a bus 266, the await status bits on a bus 
268, the PLA input signal on the bus 38, the PLA I/O signal 
on the bus 40, the output of each of the PLAs 260 and 261. 

40 Each of the input selectors provides an A and a B output on 
16-bit buses to a minimum term generator 268 which 
provides a 64-bit output This is input to a 34x32 AND array 
270 for each of the PLAs 260 and 261, the output thereof 
being a 32-bit output that is input to a fixed OR gate 272. The 
AND array 270 also provides output enable signals, two for 

45 the PLA 260 and two for the PLA 261. For PLA 260, the 
fixed OR output 272 is an 8-bit output that is input to a 
control OR gate 274, whereas the output of the fixed OR gate 
272 and PLA 261 is a 14-bit output that is input to an output 
OR gate 276 and also is input to the control OR gate 274 and 

50 PLA 260. The output of the control OR gate 274 and PLA 
260 is input to an 8-bit control register 278, the output 
thereof providing the PLA control signals, there being four 
2-bit control signals output therefrom. This control register 
278 also provides the output back to the input selectors 264 

55 for both PLAs 260 and 261. The output of the output OR 
gale 276 and the PLA 261 is input to an output register 280, 
the output thereof providing an 8-bit output that is input back 
to the input selectors 264 for both PLAs 260 and 261 and 
also to an I/O buffer 282. The output of the I/O buffer is 

60 connected to the I/O bus 40 that is input to the input selector 
264 and comprising 8- bit output. The I/O buffer 282 also 
receives the output of the output OR 276. The general 
operation of the PLA is described in U.S. Pat. No. 5,357,152, 
issued Oct. 18, 1994 to E. W. Jennings and G. H. Landers, 

65 which is incorporated herein by reference. 

The Dual PLA 26 provides the two in-circuit 
programmable, 32 input by 34 product term PLAs 260 and 
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261. PLAO 260 may serve as a state machine to coordinate 
the Macro-Sequencer array operation with external devices. 
PLA1 261 may be used for random interface logic. The Dual 
PLA 26 may perform peripheral logic or control functions 
based upon the state of BUS4IN, PLAIN and PLAI/O bus 
states and the Control bus20. The Dual PLA control func- 
tions which may be used by any or all of the MacroSe- 
quencers include: 

Registered control outputs, CtrlRcg[7:0], for 
Initiation of LIW sequences; and 
Control response to Send and Await status signals. 
Combinatorial outputs, oepla[3:0], used to generate Out- 
put Enable signals for the MacroSequencers. The oepla 
[3:0] signals are generated from individual product 
terms. 

The PLAO 260 produces eight CtrlReg outputs that can be 
used as MacroSequencer control signals where two signals 
are available for each of the MacroSequencers 10-14 to use 
as Control signals. They are also available as feedbacks to 
both PL\0 260 and PLA1 261. The CtrlReg[7:0] signals are 
useful in multi-chip array processor applications where 
system control signals are transmitted to each RADR PLA1 
261 produces combinatorial or registered I/O outputs for the 
PLAI/O[7:0] pins 40. The fourteen Fixed OR outputs(F01) 
from OR gate 272 from PLA1 261 are also available to the 
Control OR array 274 in the PLAO 260. The PLAI/O signals 
are useful for single chip applications requiring a few 
interface/handshake signals, and they are useful in multi- 
chip array processor applications where system control 
signals are transmitted to each device. 

RADP Configuration 

The RADP is configured by loading the configuration file 
into the device. 

RADP Configurable Memories There are three memories 
in each of the four MacroSequencers and a Dual PLA 
configuration memory. Within each of the 
MacroSequencers, there is an: 

LIW memory with the nine configuration bits, 

One-Port data memory, and 

Three-Port data memory. 

The nine programmable configuration bits within each 
MacroSequencer are configured as additional configuration 
data words in the LIW configuration data packet. The LIW 
memory, configuration bits, and Dual PLA memory may 
only be loaded during Active Configuration Mode. The 
One-Port and Three-Port data memories for each MacroSe- 
quencer may be loaded during Active Configuration and 
accessed during normal operating mode as directed by each 
MacroSequencer's LIW Register. 

RADP Operating Modes 

The configuration is to be loaded into the RAJDP during 
Active Configuration Mode. The RADP may be in one of 
three operating modes depending on the logic states of 
PGM0 and PGM1: 

In the Normal Operation mode, the RADP MacroSe- 
quencers concurrently execute the UWs programmed into 
each LIW memory. 

The RADP is configured during the Active Configuration 
mode which allows each MacroSequencer's instruction 
memory and Data Memories and the Dual PLA to be 
programmed. 

Passive Configuration mode disables the device I/O pins 
from operating normally or being configured which allows 
other RADPs in the same circuit to be configured. 

Four configuration pins, named PGM0, PGM1, PRDY, 
and PACK, are used to control the operating mode and 
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configuration process. BUS4IN[15:0] pins are used to input 
the configuration data words. 
MULTIPLIER-ACCUMULATOR 
The Multiplier- Accumulator (MAC) 68 is described here- 

5 inabove with reference to the FIG. 3 and FIG. 6. In general, 
this is a synchronous multiplier-accumulator circuit and is 
composed of two pipe stages. 

The first pipe stage is composed of a network of a 
multiplicity small bit multipliers, a multiplicity of local 

10 carry propagate adders forming a multiplicity of trees and a 
pipeline register circuit for holding the results of the roots of 
each adder tree. The leaves of these adder trees are from the 
multiple digit output of the small bit multiplier circuits. The 
second pipe stage is composed of a multiplicity of local 

is carry propagate adders of which all but one of which 
comprise a tree taking the synchronized results of the 
multiplicity of adder trees of the first pipe stage and forming 
a single sum of all adder tree results from the first pipe stage. 
An interface circuit operates on this resulting sum and on a 

20 possibly selected component of the accumulator register(s) 
contents of this pipe stage. The interface circuit either, may 
zero the feedback from the accumulator registers) 14 in 
accumulator 148 and pass the resultant sum from the above 
mentioned adder tree in this pipe stage through or it may 

25 align the resultant sum and the (possibly) selected accumu- 
lator result for processing by the last local carry propagate 
adder. The output of this adder is again submitted to a second 
interface circuit which can modify the adders output by 
alignment, or by zeroing the result. The output of this 

30 interface circuit is then stored in one of the (possibly) 
multiplicity of accumulator registers which comprise the. 
pipeline register bank of this pipe stage. Extensions of this 
multiplier-accumulator embodying input pipe registers 
potentially containing portions of the small bit multiplier 

35 circuitry, variations to the tree structure of the local carry 
propagate adder trees in both pipe stages are claimed. 
Implementations of this basic circuit and extensions 
embodying standard integer, fixed point and floating point 
arithmetic, as well as scalar and matrix modular 

40 decomposition, p-adic fixed and p-adic floating point and 
extended scientific precision standard and p-adic floating 
point arithmetic are included. Extensions embedding imple- 
mentations of the multiplier-accumulator including one or 
more carry propagate adders, multiple data memories cir- 

45 cuitry minimally comprising one-port RAM and three-port 
(2 read port and 1 write port) RAM with synchronization 
registers, shift and alignment circuitry plus content addres- 
sable memory(ies) as well as bit level pack and unpack 
circuitry arc also included. Extensions embedding multiple 

50 instances of implementations of any of the above claimed 
circuitry within a single integrated circuit arc also included. 

For the purpose of describing the MAC 68, some defini- 
tions may be useful. 
They will be set forth as follows: 

55 Wire 

A wire is a means of connecting a plurality of commu- 
nicating devices to each other through interface circuits 
which will be identified as transmitting, receiving or 
bi-directional interfaces. A bi-directional interface will con- 

60 sist of a transmitter and receiver interface. Each transmitter 
may be implemented so that it may be disabled from 
transmitting. This allows more than one transmitter may be 
interfaced to a wire. Each receiver may be implemented so 
that it may be disabled from receiving the state of the wire 

65 it is interfaced to. A wire will be assumed to distribute a 
signal from one or more transmitters to the receivers inter- 
faced to that wire in some minimal unit of time. This signal 
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can be called the state of the wire. A signal is a member of These limitations are common in sub-micro CMOS VLSI 

a finite set of symbols which form an alphabet. Often this fabrication technologies. The use of more resilient timing 

alphabet consists of a 2 element set, although use of multi- schemes has been discussed in the Alternative Clocking 

level alphabets with more than 2 symbols have practical Scheme references, ft will be assumed that a pipeline stage 

applications. The most common wire is a thin strip of metal s will contain a pipeline register component governed by 

whose states are two disjoint ranges of voltages, often control signals of either a traditional synchronous or a 

denoted as '0' and '1*. This alphabet has proven extremely scheme such as those mentioned in the Alternative Clocking 

useful throughout the development of digital systems from Scheme References. 

telegraphy to modern digital computers. Other metal strip K-ary Trees, K-ary and Uniform Trees with Feedback 

systems involving more voltages ranges, currents and fre- 10 For the purposes of this document, a directed graph 

quency modulation have also been employed. The key G(V,E) is a pair of objects consisting of a finite, non-empty 

similarity is the finite, well defined alphabet of wire states. set of vertices V«{v[l], . . . , v[n]) and a finite set of edges 

An example of this is multiple valued current-mode encoded E=(c[l], . . . , e[k]) where each edge c is an ordered pair of 

wires in VLSI circuits such as described in "High-Speed vertices belonging to V. Denote the first component of e[j] 

Area-Efficient Multiplier Design Using Multiple- Valued is by e[jll] and the second component by e{jl2]. Vertices will 

Current-Mode Circuits'* by Kawhito, et. al. Wires have also also be known as nodes in what follows. A directed graph is 

been built from optical transmission lines and fluidic trans- connected if each vertex is a component in at least one edge, 

mission systems. The exact embodiment of the wires of a A directed graph G(V,E) possesses a path if there exists a 

specific implementation can be composed of any of these finite sequence of edges (ek[l] 7 ek[2], . . . ,ek[h]) where h>-2 

mechanisms, but is not limited to the above. Note that in 20 is a subset of E such that the first component of ekjj+l] is 

some high speed applications, the state of a wire in its also the second component of ekjj] for j-1, . . . , h-1. A 

minimal unit of time may be a function of location within the directed graph G(V,E) possesses a cycle if there exists a path 

wire. This phenomena is commonly observed in fluidic, (ek[l],ek[2], . . . ,ek[]) where b>»2 such that the second 

microwave and optical networks due to propagation delay component of ek[h] is also the first component of ek[l]. A 

effects. This may be a purposeful component of certain 25 connected directed graph which possesses no cycles is a tree, 

designs and is encompassed by this approach. Note that typically, this would be called a directed tree, but 

Signal Bundle and Signal Bus since directed graphs are the only kind of graphs considered 

A signal bundle and a signal bus are both composed of a here, the name has been simplified to tree. A k-ary tree is a 

plurality of wires. Each wire of a signal bundle is connected tree where k is a positive integer and each vertex(node) of 

to a plurality of communicating devices through interface 30 the tree is either the first component in k edges or is the first 

circuitry which is either a transmitter or a receiver. The component in exactly one edge. A k-ary tree with feedback 

direction of communication within a signal bundle is con- is a directed graph G(V,E) such that there exists an edge ew 

stant with time, the (communication devices which are trans- such that the directed graph G1(VJE1) is a k-ary tree, where 

mining are always transmitting. Those which are receiving El contains all elements of E except ew. Note that G(V,E) 

are always receiving. Similarly, each wire of a signal bus is 35 contains one cycle. A uniform tree is a tree such that the 

also connected to a plurality of communicating devices. The vertices form sets called layers L[l], . . . , L[m] such that the 

communicating devices interfaced to a signal bus are uni- height of the tree is m and the root of the tree belongs to 

formly attached to each wire so that whichever device is L[l], all vertices feeding the this root vertex belong to 

transmitting transmits on all wires and whichever device(s) L[2], . . . , all vertices feed vertices of L[k] belonging to 

are receiving are receiving on all wires. Further, each 40 L[k+1], etc. It is required the vertices in each layer all have 

communicating device may have both transmitters and the same number of edges which target each vertex in that 

receivers, which may be active at different time intervals. layer. The notation (kl, k2, . . . , kn) where kl kn are 

This allows the flow of information to change in direction positive integers will denote the kl edges feeding the vertex 

through an succession of intervals of time, Le. f the source in Lfl], k2 edges feeding each vertex in L[2], kn edges 

and destinations^) for signals may change over a succession 45 feeding each vertex in L[n]. A uniform tree with feedback 

of time intervals. differs from a uniform tree in that one edge forms a circuit 

Pipeline Register and Stage within the graph. 

The circuitry being claimed herein is based upon a p-adic Number Systems 

sequential control structure known as a pipeline stage. A A p-adic number system is based upon a given prime 

pipeline stage will be defined to consist of a pipeline register 50 number p. A p-adic representation of an unsigned integer k 

and possibly a combinatorial logic stage. The normal opera- is a polynomial -k«a M p"+a^. 1 p'*" 1 +. . . +a 2 p+ao, where a„, 

tional state of the pipeline stage will be the contents of the a^ lf . . . , a 2 , ag are integers between 0 and p-1. A fixed 

memory components within the pipeline register. Additional length word implementation of signed p-adic numbers is 

state information may also be available to meet testability also represented as a polynomial with the one difference 

requirements or additional systems requirements outside the 55 being that the most significant p-digit, a„ now ranges 

intent of this patent. Typical implementations of pipeline between (p-l)/2 and (p-l)/2. 

stage circuits are found in synchronous Digital Logic Sys- Two's Complement Number System 

terns. Such systems use a small number of control signals Two's complement Numbers is a signed 2-adic number 

known as clocks to synchronize the state transition events system implemented in a fixed word length or multiples of 

within various pipeline stages. One, two and four phase 60 a fixed word length. This is the most commonly used integer 

clocking schemes have been widely used in such number system in contemporary digital computers, 

approaches. See the references listed in the section entitled Redundant Number Systems and Local Carry Propagation 

Typical Clocking Schemes for a discussion of these Adders 

approaches applied to VLSI Design. These typical A redundant number system is a number system which 
approaches face severe limitations when clocks must 65 has multiple distinct representations for the same number. A 
traverse large distances and/or large varying capaciuve loads common redundant number system employs an entity con- 
across different paths within the network to be controlled. sisting of two components. Each component possesses the 



US 6,2' 

23 

same bit length. The number represented by such an entity 
is a function (often the difference) between the two compo- 
nents. A local carry propagation adder will be defined as any 
embodiment of an addition and/or subtraction function 
which performs its operation within a constant time for any 
operand length implementation. This is typically done by 
propagating the carry signals for any digit position only to 
a small fixed number of digits of higher precision. This 
phenomena is called local carry propagation. A primary 
application of redundant number systems is to provide a 
notation for a local carry propagation form of addition and 
subtraction. Such number systems are widely used in the 
design of computer circuitry to perform multiplication. In 
the discussion that follows, Redundant Binary Adder Cells 
are typically used to build implementations such as those 
which follow. The local carry propagate adder circuits 
discussed herein may also be built with Carry-Save Adder 
schemes. There are other local or Limited carry propagation 
adder circuits which might be used to implement the fol- 
lowing circuitry. However, for the sake of brevity and 
clarity, only redundant adder schemes will be used in the 
descriptions that follow. Many of the references here in below 
with respect to the High Speed Arithmetic Circuitry discuss 
or use redundant number systems. 

Modular Decomposition Number Systems 

Modular Decomposition Number Systems are based upon 
the Chinese Remainder Theorem. This theorem was first 
discovered and documented for integers twenty centuries 
ago in China. Hie Chinese Remainder Theorem states that: 
Let m[l], m[2], . . . , m[n] be positive integers such that m[i] 
and m[j] are relatively prime for I not equal j. If b[l], 
b[2], . . . , b[n] be any integers, then the system of 
congruences x«b[i] (mod m[i] ) for 1-1, . . . , n, has integral 
solution that is uniquely determined modulo m=m[l] ♦ m[2] 
* ... * m[n]. The Chinese Remainder Theorem has been 
extended in the last hundred and fifty years to a more general 
result which is true in any nontrivial algebraic ring. Note that 
square matrices form algebraic rings and that both modular 
decomposition matrix and p-adic number systems can be 
built which have performance and/or accuracy advantages 
over typical fixed or floating point methods for a number of 
crucial operations, including matrix inversion. Modular 
Decomposition Number Systems have found extensive 
application in cryptographic systems. An important class of 
cryptographic systems are based upon performing multipli- 
cations upon very large numbers. These numbers often 
involve 1000 bits. Arithmetic operations have been decom- 
posed into modular multiplications of far smaller numbers. 
These decompositions allow for efficient hardware imple- 
mentations in integrated circuits. The modular multiplica- 
tions of these smaller numbers could well be implemented 
with the multiplier architectures described hereinbelow. 
Such multiplier implementation would have the same class 
of advantages as in traditional numerical implementations. 

Standard Floating Point Notations 

Standard Floating Point Notation is specified in a docu- 
ment published by ANSI. Floating point arithmetic opera- 
tions usually require one of four rounding mode to be 
invoked to complete the generation of the result. The round- 
ing modes are used whenever the exact result of the opera- 
tion requires more precision in the mantissa than the format 
permits. The purpose of rounding modes is to provide an 
algorithmic way to limit the result to a value which can be 
supported by the format in use. The default mode used by 
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compiled programs written in C, PASCAL, BASIC, FOR- 
TRAN and most other computer languages is round to 
nearest Calculation of many range limited algorithms, in 
particular the standard transcendental functions available in 
FORTRAN, C, PASCAL and BASIC require all of the other 
three modes: Round to positive infinity, Round to negative 
infinity and round to zero. Round to nearest looks at the bits 
of the result starting from the Least significant bit supported 

10 and continuing to the least significant bit in the result. The 
other three rounding modes are round to 0, round to negative 
infinity and round to positive infinity, which are well docu- 
mented in IEEE-ANSI specification for standard floating 

!5 point arithmetic. 

Extended Precision Floating Point Notations 
Extended Precision Floating Point Notations are a pro- 
posed notational and semantic extension of Standard Boat- 
ing Point to solve some of its inherent limitations. Extended 
Precision Floating Point requires the use of accumulator 
mantissa fields twice as long as the mantissa format itself. 
This provides for much more accurate multiply-accumulate 
operation sequences. It also minimally requires two accu- 

25 mutators be available, one for the lower bound and one for 
the upper bound for each operation. The use of interval 
arithmetic with double length accumulation leads to signifi- 
cantly more reliable and verifiable scientific arithmetic pro- 
cessing. Long Precision Floating Point Notations involve the 
use of longer formats. For example, this could take the form 
of a mantissa which is 240 bits (including sign) and an 
exponent of 16 bits. Extended Long Precision Floating Point 
Notations would again possess accumulators supporting 

35 mantissas of twice the length of the operands. These exten- 
sions to standard floating point have great utility in calcu- 
lations where great precision is required, such as interplan- 
etary orbital calculations, solving non-linear differential 
equations, performing multiplicative inverse calculations 

40 upon nearly singular matrices, 
p-adic Floating Point Systems 

P-adic arithmetic can be used as the mantissa component 
of a floating point number. Current floating point implemen- 

45 tations use p»2. When p>2, rounding to nearest neighbor has 
the effect of converging to the correct answer, rather than 
often diverging from it in the course of executing a sequence 
of operations. The major limitation of this scheme is that a 
smaller subset of the real numbers than can be represented 

50 compared with the base 2 arithmetic notation. Note that the 
larger p is and the closer it is to a power of two, the more 
numbers can be represented in such a notation for a fixed 
word length. One approach to p-adic floating point arith- 

ss metic would be based upon specific values of p with 
standard word lengths. The next two tables assume the 
following format requirements: 

The mantissa field size must be a multiple of the number 
of bits it takes to store p. 
60 The mantissa field size must be at least as big as the 
standard floating point notation. 

The exponent field will be treated as a signed 2's comple- 
ment integer. 

65 The mantissa sign bit is an explicit bit in the format. 
The following Table 6 summarizes results based upon these 
assumptions for Word Length 32: 
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TABLE 6 
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Mao- 



Mantissa Digits base p 





Exponent 


Field Numerical 


Dynamic Range (in base 


p 


Field Size 


Size 


Expression 


10) 


3 


7 


24 


Mamissa*3 E ** 0<1H " 


12 digits 

3*° to 3-** (10 30 to 10" 31 ) 


7 


7 


24 


Mantissa*?****™ 111 


8 digits 

r° to 7-* 1 (10 53 to lO^ 54 ) 


15 


7 


24 


Mantissa* 15***-*" 


6 digits 

15*° to 15"* 4 (10 74 to 
HT 75 ) 


31 


6 


25 


Mantissa «3j*»p««« 


5 digits 

31 51 to 31-* 2 (10" to 10 47 ) 



10 



15 



Note the from this table: 
The standard single precision Moating point mantissa is 23 bits, with an 
implied 24 bit. 
Its exponent field is 8 bits. 

The standard single precision floating point dynamic range is 2 127 to 2" l2B 
(10 3 * to 10* 39 ). 

The p - 7, 15 and 31 formats all have greater dynamic range and at least 20 
as much mantissa precision as the standard single precision format. 

The following table summarizes results based upon these 
assumptions for Word Length 64: 

25 



TABLE 7 






Man- 








tissa 


Mantissa Digits base p 




Exponent 


Field Numerical 


Dynamic Range (in base 


P 


Held Size 


Size Expression 


10) 


3 


9 


54 Mantissa*3 E " poMt 


27 digits 
















10-" 2 ) 


7 


9 


54 Mantissa *7* TTinnrTTt 


18 digits 








f** to (10 2 " to 








io- JAa ) 


15 


7 


56 Mantissa*l5 E ** ,0 ~ m 


14 digits 








15** to 15" 84 (10 74 to 








lO" 75 ) 


31 


8 


55 Mantissa"31 E *P COOB 


11 digits 








31 127 to 31' 128 (10 lBO to 








io- lw > 



30 



40 



Note from this table: 

The standard double precision floating point mantissa is 53 bits, with an 
implied 54-th bit. Its exponent field is 10 bits. 
The standard doable precision floating point dynamic range is 2 s11 to 
2-*« (10 153 to 10-" 4 ). 

The p - 7 and 31 formats have greater dynamic range and at [east as 
much mantissa precision as the standard double precision format 

One may conclude from the above two tables that p-adic 
floating point formats based upon p-7 and p-31 offer 
advantages in dynamic range with at least as good mantissa 
accuracy for both single and double precision(32 and 64 bit) 
formats. It seems reasonable that p-7 has distinct advantages 
over p-31 in terms of inherent implementation complexity. 
The mantissa component of a floating point number system 
can also be composed of two components, known here as 
MSC and LSC, for Most Significant Component and Least 
Significant Component, respectively. The MSC can be con- 
structed as a binary or 2-adic system and the LSC can be 
constructed from a p-adic system where p>2. Such an 
arrangement would also converge to the correct answer in 
round to nearest neighbor mode and would have the advan- 
tage of making All use of the bits comprising the MSC. If the 
LSC occupies the "guard bits" of the floating point arith- 
metic circuitry, then the visible effect upon the subset of 
floating point numbers which can be represented is the 
consistent convergence of resulting operations. This would 
aid standard Floating Point notation implementation. If p is 
near a power of two, then p-adic number based mantissa 



45 



50 



55 



60 



65 



calculations would be efficiently stored in memory. Particu- 
larly for p=3 and 7, the modular arithmetic multiplier 
architecture could amount to specializing the redundant 
binary adder chain in each adder strip and slightly changing 
the Booth encoding algorithms discussed in the following 
implementation discussions. If the MSC represented all but 
2, 3 or 5 bits of the mantissa, then p«3, 7 or 31 versions of 
p-adic arithmetic could respectively be used with minimal 
impact on bow many numbers could be represented by such 
notations. Note that for this kind of application, p need not 
be restricted to being prime. As long as p was odd, the 
desired rounding convergence would result. It will be gen- 
eral assumed throughout this document that p-3,7,15 and 31 
are the most optimal choices for p-adic floating point 
extensions, which are "mostly" prime. Both the number 
systems discussed in the previous paragraphs will be des- 
ignated as p-adic floating point systems with the second 
version involving the MSC and LSC components being 
designated the mixed p-adic floating point system when 
relevant in what follows. Both of these notations can be 
applied to Extended Precision Floating Point Arithmetic. 

Overview Discussion of the MAC 

The basic operation of a multiplier 142 is to generate from 
two numbers A and B, a resulting number C which repre- 
sents something like standard integer multiplication. The 
accumulation of such results, combined with the multipli- 
cation are the overall function of a multiplier/accumulator. It 
is noted that the accumulation may be either additive, 
subtractive or capable of both. 

This description starts with a basic block diagram of a 
multiplier-accumulator and one basic extension of that 
multiplier/accumulator which provides significant cost and 
performance advantages over other approaches achieving 
similar results. These circuit blocks will be shown advan- 
tageous in both standard fixed and floating point 
applications, as well as long precision floating point, 
extended precision floating point, standard p-adic fixed and 
floating point and modular decomposition multiplier appli- 
cations. 

Optimal performance of any of these multiplier- 
accumulator circuits in a broad class of applications requires 
that the multiplier-accumulator circuit receive a continuous 
stream of data operands. The next layer of the claimed 
devices entail a multiplier-accumulator circuit plus at least 
one adder and a local data storage system composed of two 
or more memories combined in a network. The minimum 
circuitry for these memories consists of two memories, the 
one-port memory 44 and the 3-port memory 43. The cir- 
cuitry described to this point provides for numerous 
practical, efficient fixed point algorithmic engines for pro- 
cessing linear transformations, FFTs, DCFs, and digital 
filters. 

Extension to support various floating point schemes 
requires the ability align one mantissa resulting from an 
arithmetic operation with a second mantissa. This alignment 
operation is best performed by a specialized circuit capable 
of efficient shifting, Shifter 74. Support of the various 
floating point formats also requires efficient logical merging 
of exponent, sign and mantissa components. The shift cir- 
cuitry mentioned in this paragraph (assuming it also sup- 
ports rotate operations) combined with the logical merge 
circuitry provides the necessary circuitry for bit-packing 
capabilities necessary for image compression applications, 
such as Huffman coding schemes used in JPEG and MPEG. 
Once aligned, these two mantissas must be able to be added 
or subtracted from each other. The long and extended 
precision formats basically require at least one adder to be 
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capable of performing multiple word length "chained" 
addition-type operations, so that the carry out results must be 
available efficiently to support this. 

Support for p-adic arithmetic systems requires that the 5 
multiplier-accumulator implementation support p-adic arith- 
metic. Similar requirements must be made of at least one 
adder in an implementation. The p-adic mantissa alignment 
circuitry also makes similar requirements upon the shifter. 
Modular arithmetic applications are typically very long 10 
integer systems. The primary requirement becomes being 
able to perform high speed modular arithmetic where the 
modular decomposition may change during the execution of 
an algorithm. The focus of such requirements is upon the ^ 
multiplier-accumulator and adder circuitry. 

Basic Multiplier Overview of Basic Multiplier 142 and Its 
components 

Referring now to FIG. 17, there is illustrated a block 
diagram of basic multiplier. A very fast way to sum 2** 20 
numbers (where P is assumed to be a positive integer) is 
called a Binary Adder Tree. Adders D1-D7 form a Binary 
Adder Tree summing 8»2 3 numbers, CI to C8 in a small bit 
multiplier 300. The numbers CI to C8 are the partial ^ 
products of operand A and portions of operand B input to 
multiplier 300, which are then sent to the adder tree D1-D7. 
These partial products are generated within the multiplier 
300 by a network of small bit multipliers. The Adder D8 and 
the logic in block Gl align the resulting product from Adder 30 
D7 and the selected contents of the block Hi representing the 
second stage of pipeline registers an alignment The accu- 
mulated results are held in memory circuitry in block HI. 
This provides for the storage of accumulated products, 35 
completing the basic functions required of a multiplier- 
accumulator. 

The circuitry in the stage-one pipeline registers El acts as 
pipeline registers making the basic circuit into a two pipe- 
stage machine. The time it takes for signals to propagate 40 
from entry into multipliers 30 to the pipeline registers of El 
is about the same as the propagation time from entry into 
Adder D7 to the pipeline registers in HI. Thus the pipeline 
cycle time is about half of what it would be without the 45 
registers of El. 

Transform circuitry Jl is provided on the output of HI 
that performs several functions. It selects which collection of 
memory contents are to be sent outside the multiplier/ 
accumulator, it transforms the signal bundle to be sent to a so 
potentially different format, it selects which collection of 
memory contents are to be sent to Adder D8 for accumula- 
tion and it transforms that signal bundle to be sent to Adder 
D8, if necessary, to a potentially different format The 55 
circuitry in Jl permits the reduction of propagation delay in 
the second pipeline stage of this multiplier-accumulator, 
since the final logic circuitry required to generate the results 
can occur in Jl after the pipeline registers of HI and the use 
of non-standard arithmetic notations such as redundant 60 
binary notations in the adder cells of Dl to D9, since the 
notation used internally to the multiplier-accumulator can be 
converted to be used with a standard 2's complement adder 
for final conversion. 65 

An example of the above can be seen in implementing a 
redundant binary notation as follows: 
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TABLE 8 





A Standard Notation 


A Non-standard 


Represented 


as used in Takagi's 


Signed Magnitude 


number 


Research StflO] 


Notation Sn[l:0] 


0 


00 


10 


1 


01 


11 


-1 


10 


01 



This notation turns out to be optimal for certain CMOS 
logic implementations of an 8 by 16-bit multiplier based 
upon FIG. 17. Conversion by a standard two's complement 
adder required conversion from the Non-standard Signed 
Magnitude notation to a Standard Notation. This was done 
by implementing the logic transformation: 

St[l>QOt Sn[l] 
St(0)-Sn{0] 

Optimal implementations of redundant p-adic notations to 
carry propagate p-adic notation conversion may also require 
this. 

With the above noted structure, the following operations 
can be realized: 

Signed and Unsigned 8 by 16 bit multiplication and 

multiply-accumulate 
Signed and Unsigned 16 by 16 bit multiplication and 

multiply-accumulate 
Signed and Unsigned 24 by 16 multiplication and 

multiply-accumulate 
Signed and Unsigned 24 by 24 bit multiplication and 

multiply-accumulate 
Signed and Unsigned 24 by 32 bit multiplication and 

multiply-accumulate 
Signed and Unsigned 32 by 32 bit multiplication and 

multiply-accumulate 
Optimal polynomial calculation step 
Fixed point versions of the above: 
Standard Floating Point Single Precision Mantissa Mul- 
tiplication 

Extended Precision Floating Point Single Precision Man- 
tissa 
Multiplication 

P-Adic Floating Point Single Precision Mantissa Multi- 
plication 

P-Adic Fixed Point Multiplication and Multiplication/ 
accumulation. 

These operations can be used in various applications, 
some of which are as follows: 

1. 8 by 16 multiplication/accumulation is used to convert 
between 24 bit RGB to YUV color encoding. YUV is 
the standard broadcast NTSC color coding format. The 
standard consumer version of this requires 8 bit digital 
components to the RGB and/or YUV implementation. 

2. 16 bit arithmetic is a very common form of arithmetic 
used embedded control computers. 

3. 16 by 24 bit multiplication/accumulation with greater 
than 48 bits accumulation is capable of performing 
1024 point complex FFTs on audio data streams for 
Compact Disk Applications, such as data compression 
algorithms. The reason for this is that the FFT coeffi- 
cients include numbers on the order PI/512, which has 
an approximate magnitude of Yist. Thus a fixed point 
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implementation requires accumulation of 16 by 24 bit signal bundle. Tbe logic circuitry generating signal bundles 

multiplications to preserve the accuracy of tbe input C1-C8 will vary, depending upon the number systems being 

data. used for A and B, the number systems being employed for 

4. 24 by 24 bit multipUcation/acciimulaUon is also com- we 01-04 adders > me size of me si 8 nal bund,es A B 
monly used in audio signal processing requirements. 5 plus «he exact nature of the multiphcauon algpnthm bemg 
Note that by a similar argument to the last paragraph, ™pkmented In the d^on of following cmbodunente, 
~>a i_ t_-* it' i* a / certain specific examples will be developed. These will by 
24 by 32 bit muluplications arenecessary to preserve Q0 raeaQ Tdetail all practical implementations which could 
the accuracy of the data for a 1024 point complex FFT. ^ ^ upoQ ^ bm ^ dcmonslrate ^nun 

5. 32 bit arithmetic is considered by many to be the next applications of high practical value that are most readily 
most common used form of integer arithmetic after 16 10 discussed. 

bit. It should be noted that this arithmetic is required for Referring now to FIG. 18, there is illustrated an alternate 

implementations of the long integer type by C and C++ embodiment of the MAC 68. In this embodiment, a 16 bit by 

computer language execution environments. 16 bit multiplier/accumulator based upon a 4*3 modified 

6. Polynomial calculation step operations, particularly 80010 coding scheme is illustrated, wherein only Cl-6 are 
fixed point versions, are commonly used for low degree 15 needed for mc basic operation. C7-Y would be available for 
polynomial interpolation. These operations are a com- addin & an o2set - ^ Ieads 10 implementations capable of 
mon mechanism for implementing standard transcen- supporting P°lynomial step calculations starting every cycle, 
dental functions, such as sin, cos, tan, log, etc. assuming that me implemenUUon possessed two accumu- 

„ „ . . . . . , lators in the second pipe stage. The polynomial step entails 

7. Standard Floating Pomt Arithmetic is the most widely ^ calculating X *Z+Y, where X and Y are input numbers and 
used dynamic range arithmetic at this time. Z is the state of an accumulator register in HI. Implemen- 

8. Extended Precision Floating Point arithmetic is appli- tation of 4-3 Modified Booth Coding schemes and other 
cable wherever Standard Floating Point is currently similar mechanisms will entail multipliers 300 containing 
employed and resolves some serious problems with the equivalent of an adder similar to those discussed here- 
rounding errors or slow convergence results. The major 25 inbelow. 

drawback to this approach is that it will run more Referring now to FIG. 19, there is illustrated an embodi- 

slowly the comparable Standard Floating Point Arith- ment of the MAC 68 which is optimized for polynomial 

metic. It is important to note that with this approach, calculations. In this case, all eight small bit multiplications 

there is no performance penalty and very limited addi- (CI to C8) are used. In such situations, the Jl component can 

tional circuit complexity involved in supporting this 30 provide Z for the calculation through a multiplexer 302. Gl 

significant increase in quality. performs alignment of the accumulators) being used for - 

9. P-Adic Floating Point and Fixed Point arithmetic are potential input to both multipliers 300 and Adder D7. Adder 
applicable where Standard Floating point or fixed point D9 now requires controls to support alignment of the 
arithmetic are used, respectively. The advantage of product with the target accumulator. This is done by trans- 
these arithmetics is that they wuU tend to converge to 35 mitting through the local carry propagation chain in D9 
the correct answer rather than randomly diverging in signals which act to mask carry propagation to successive 
round to nearest mode and can take about the same digit cells and control transmission of top-most digits) carry 
amount of time and circuitry as standard arithmetic propagation signals to the bottom most cell(s). This makes 
when implemented in this approach. It should be noted the Adder D9 into a loop of adder cells which can be broken 
that in the same number of bits as Standard Floating 40 at one of several places. Jl already had a requirement of 
Point, implementations of p-7 p-adic floating point aligning and potentially operating on tbe stored state of its 
have greater dynamic range and at least the same accumulators) before feedback, this circuit implementation 
mantissa precision, making these numeric formats bet- just adds slightly to that requirement. 

ter than standard floating point. Note that in the circuits represented by FIGS. 18 and 19, 

Referring further to FIG. 17, the operation of the various 45 the presence of at least two accumulators is highly desirable, 

components will be described in more detail. The multipliers such that two polynomial calculations can then be performed 

in a small bit multiplier block 300 perform small bit mul- in approximately the same time as one is performed. This is 

tiplications on A and D and transform signal bundles A and due to the 2 pipe stage latency in the multiplier. 

B into a collection of signal bundles CI to C8 which are then Adders Dl to D4 perform local carry propagation 

sent to the Adder circuits D1-D4. Signal bundles A and B 50 addition, typically based upon some redundant binary nota- 

each represent numbers in some number system, which docs tion or implementation of carry-save adders. They serve to 

not have to be the same for both of them. For instance, A sum the partial products CI to C8 into four numbers. The 

might be in a redundant binary notation, whereas B might be partial products CI to C8 are digit-aligned through how they 

a two's complement number. This would allow A to contain are connected to tbe adders in a fashion discussed in greater 

feedback from an accumulator in the second pipe stage. This 55 detail later. These adders and those subsequently discusse J 

would support an optimal polynomial calculation step opera- herein can be viewed as a column or chain of adder cells, 

lions. Number systems which maybe applicable include, but except where explicitly mentioned. Such circuits will be 

are not limited to, signed and unsigned 2's complement, referred to hereafter as adder chains. It is noted that all 

p-adic, redundant binary arithmetic, or a modular decom- adders described herein can be implemented to support 

position systems based on some variant of the Chinese 60 p-adic and modular arithmetic in a redundant form similar to 

Remainder Theorem. the more typical 2-adic or redundant binary form explicitly 

The signal bundles CI to C8 are partial products based used hereafter, 

upon the value of a small subset of one of the operands (A Adders D5 and D6 perform local carry propagation addi- 

or B) and all of the other operand. In the discussion that tion upon the results of Adders Dl, D2 and D3, D4 respec- 

follows, it will be assumed that the A signal bundle is used 65 lively. 

in its entirety for generating each C signal bundle and a The circuitry in El acts as pipeline registers making the 

subset of the B signal bundle is used in generating each C basic circuit into a two pipe-stage machine. The memory 
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circuits of El hold the results of adders D5 and D6. It may 
also bold Y io FIG. 19, which may either be seat from a bus 
directly to El, or may have been transformed by the mul- 
tiplier block 300 to a different notation than its form upon 
input In certain embodiments, the last layers of the logic in 5 
Adders D5 and D6 may be "moved" to be part of the output 
circuitry of the pipeline registers of El. This would be done 
to balance the combinatorial propagation delay between the 
first and second pipeline stages. The time it takes for signals 
to propagate from entry into multiplier block 300 to the 10 
pipeline registers of El is then about the same as the 
propagation time from output of the El registers into Adder 
D7 to the pipeline registers in HI. Thus the pipeline cycle 
time is about half of what it would be without the registers 
of El. In certain applications, this register block El may be 15 
read and written by external circuitry with additional mecha- 
nisms. This could include, but is not limited to, signal bus 
interfaces and scan path related circuitry. 

Adders D7 and D8 receive the contents of the memory 
circuits of El, which contain the results of the Adders D5 20 
and D6 from the previous clock cycle. D7 and D8 perform 
local carry propagation addition on these signal bundles. The 
result of Adder D7 is the completed multiplication of A and 
B. This is typically expressed in some redundant binary 
notation. 25 

Gl aligns the product which has been generated as the 
result of Adder D7 to the accumulator Hi's selected con- 
tents. GI selects for each digit of the selected contents of HI 
either a digit of the result from Adder D7 or a '0' in the digit 
notation to be added in the Adder D8. Gl also can support 30 
negating the product resulting from D8 for use in accumu- 
lation with the contents of a register of HI. Assume that the 
contents of HI are organized as P digits and that the 
multiplication result of Adder D7 is Q digits and the length 
of A is R digits and B is S digits. It is reasonable to assume 35 
that in most numeric systems, Q>-R+S and P>-Q. If 
P>— Q+S, then Gl can be used to align the result of Adder D7 
to digits S to Q+Max(R T S) 7 thus allowing for double (or 
multiple) precision multiplications to be performed within 
this unit efficiently. This provides a significant advantage, 40 
allowing multiple precision integer arithmetic operations to 
be performed with a circuit possessing far fewer logic 
components than would be typically required for the entire 
operation to be performed. Combined with the two pipe 
stage architecture, this makes double precision multiplica- 45 
tions take place about as fast as a single pipestage version 
with somewhat more half the number of logic gates. 

In FIGS. 17 and 18, Adder D9 is composed of local carry 
propagation adder cells as in Adders Dl to D7. It adds the 
aligned results of the Adder D7 to the selected contents of 50 
HI to provide the signal bundle to HI for storage as the new 
contents of one memory component in HI. In FIG. 19, 
Adder D9 is composed of a loop of local carry propagate 
adder cells which may be broken at one of several places to 
perform the alignment of the product with the accumulator. 55 

HI contains one or more clocked memory components 
(known hereafter as registers) which act as temporary stor- 
age accumulators for accumulating multiplications coming 
from Adder D9. Given the exact nature of multiplier block 
300, Gl and the number of digits in each of Hi's registers, 60 
and the performance requirements for a particular imple- 
mentation of this circuit, the optimal number of registers 
contained in HI will vary. In certain applications, this 
register block HI may be read and written by external 
circuitry using additional mechanisms. This could include, 65 
but is not limited to signal bus interfaces and scan path 
related circuitry. 



036 Bl 

32 

If Hi has more than one register, Jl selects which of these 
registers will be output to external circuitry. Jl also selects 
which of these registers is to be used for feedback to Adder 
D9 in FIGS. 1 and 2 and Adder D8 in FIG. 19. Jl selects 
which portion of Hi's selected registers) will be transmitted 
in cases where the register is longer than either the receiving 
buss or carry propagate adder it will enter. If the internal 
notation of an implementation of this circuit is not a standard 
notation, then the signal bundle to be transmitted to external 
circuitry is transformed by Jl into a standard notation which 
can then be converted by a carry propagate adder into the 
relevant standard arithmetic notation. In embodiments 
where extended precision arithmetic is a requirement, Jl can 
be used to "move the more significant bits down" and insert 
0's in the vacated most significant bits. In embodiments 
requiring the accumulator contents be subtracted from the 
generated product from Adder D7, Jl would also perform 
negating the selected registers contents for delivery to the 
input of Adder D9 in FIGS. 1 and 2 and Adder D8 in FIG. 
19. 

Embodiments of this architecture support high-speed 
multiple-precision operations, which is not possible in typi- 
cal integer or fixed-point arithmetic circuits. Hie perfor- 
mance of multiple- precision operations lowers throughput, 
but preserves the exactness of result. These are not possible 
at anything approaching the throughput and size of circuitry 
based upon this block diagram. Embodiments of this archi- 
tecture can support standard single-precision floating point 
mantissa multiplications with significantly less logic cir- 
cuitry than previous approaches. Embodiments of this archi- 
tecture appear to be the only known circuits to support small 
p-adic mantissa multiplications. The authors believe that this 
is the first disclosure of such a floating point representation. 
Embodiments of this architecture provide a primary mecha- 
nism for implementing Extended precision Floating Point 
Arithmetic in a minimum of logic circuitry. Embodiments of 
this architecture also provide implementations of efficient 
high speed modular arithmetic calculators. 

Basic Multiplier Embodied as 8 by N multiplier- 
accumulator based upon FIG. 17 

In this discussion, AO represents the least significant digit 
of the number A. The digits of A are represented in descend- 
ing order of significance as AfAeAdAc, AbAaA9A8, 
A7A6A5A4, A3A2A1A0. B is represented as an 8 digit 
number represented by B7B6B5B4, B3B2B1B0. 

Multipliers 300 arc controlled by a signal bundle. One 
control signal, to be referred to as Ul. A sign determines 
whether the A operand is treated as a signed or an unsigned 
integer. A second control signal, referred to as Ul.Bsign 
determines whether the B operand is treated as a signed or 
unsigned integer. Four distinct one digit by one digit mul- 
tiplications are performed in the generation of the CI to C8 
digit components for the adders Dl to D4. Let Ax represent 
a digit of A and By represent a digit of B. The operation 
AxuBy is an always unsigned multiplication of digit Ax with 
digit By. The operation AxsBy is an unsigned multiplication 
of Ax and By when the Ul.Asign indicates the A operand is 
unsigned. The operation AxsBy is a signed multiplication 
when the Ul.Asign indicates that the A operand is a signed 
integer. The operation BysAx is an unsigned multiplication 
of Ax and By when the Ul.Bsign indicates the B operand is 
unsigned. The operation BysAx is a signed multiplication 
when the Ul.Bsign indicates that the B operand is a signed 
integer. The operation AxSBy is an unsigned multiplication 
when both Ul.Asign and Ul.Bsign indicate unsigned inte- 
ger operands. The operation AxSBy is a related to the 
multiplication of the most significant bits of A and B. This 
operation is determined by controls which specify whether 
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the individual operands arc signed or unsigned. The follow- Column definitions for the following performance evalu- 
ing Table 9 illustrates C1-C8 for digits 0 to 23: ation tables: 



TABLE 9 



CI 


C2 


C3 


C4 


C5 


C6 


C7 


C8 


Digit k 


o 


0 


0 


0 


0 


0 


0 


0 


23 


o 


0 


o 


0 


o 


0 


0 


AfSB7 


22 


o 


0 


o 


o 


o 


o 


A&B6 


AeuB7 


21 


o 


0 


o 


o 


o 


A&B5 


AeuB6 


AduB7 


20 


o 


0 


o 


o 


AfaB4 


AeuBS 


AduB6 


AcuB7 


19 


o 


0 


o 


AfsBJ 


AeuB4 


AduBS 


AcuB6 


AbuB7 


18 


o 


0 


A£bB2 


AeuB3 


AduB4 


AcuBS 


AbuB6 


AauB7 


17 


0 


AfsBl 


AeuB2 


AduB3 


AcuB4 


AbuB5 


AauB6 


A9uB7 


16 


AfsBO 


AeuBl 


AduB2 


Acufi3 


AbuB4 


AauBS 


A9uB6 


A8uB7 


15 


AeuBO 


AduBl 


AcuB2 


AbuB3 


AauB4 


A9uB5 


A8uB6 


A7uB7 


14 


AduBO 


AcuBl 


AbuB2 


AauB3 


A9uB4 


ASuBS 


A7uB6 


A6uB7 


13 


AcuBO 


AbuBl 


AauB2 


A9uB3 


A8uB4 


A7uB5 


A6uB6 


A5uB7 


12 


AbuBO 


AanBl 


A9uB2 


A8uB3 


A7uB4 


A6uB5 


A5uB6 


A4uB7 


11 


AauBO 


A9uBl 


A8uB2 


A7uB3 


A6uB4 


A5uB5 


A4uB6 


A3uB7 


10 


A9uB0 


ASuBl 


A7uB2 


A6uB3 


A5uB4 


A4uB5 


A3uB6 


A2uB7 


9 


AduBO 


A7uBl 


A6uB2 


A5uB3 


A4uB4 


A3uB5 


A2uB6 


AluB7 


8 


A7uB0 


A6uBl 


A5uB2 


A4uB3 


A3uB4 


A2uB5 


AJuB6 


AOuB7 


7 


A6uBO 


A5uBl 


A4uB2 


A3uB3 


A2uB4 


AluBS 


A0uB6 


0 


6 


ASuBO 


A4uBl 


A3uB2 


A2uB3 


AluB4 


AOuBS 


0 


0 


5 


A4uB0 


A3uBl 


A2uB2 


AluB3 


A0uB4 


0 


0 


0 


4 


A3uBO 


A2uBl 


AluB2 


AOuBJ 


0 


0 


0 


0 


3 


A2uB0 


AJuBl 


A0uB2 


0 


0 


0 


0 


0 


2 


AluBO 


AOuBl 


0 


0 


0 


0 


0 


0 


1 


AOuBO 


0 


0 


0 


0 


0 


0 


0 


0 



Discussion of Adders Dl to D7 

Adders Dl to D4 contain 18 digit cells for addition. 
Adders D5 and D6 contain 21 digits cells for addition. Adder 
D7 contains 25 digit cells for addition. Each of these adders 
contains one more cell than the number of digits for which 
they have no inputs. Implementations of D8, Gl, HI and Jl 
to achieve various arithmetic requirements. 

Performance Evaluation of 1-bit small-bit multipliers 
Table 10 illustrates Capability Versus Size Comparison 
with N-16 based upon FIG. 17. 



"Operation" describes a form of integer multiplication 
generating the exact result which may be accumulated. 

"Acc Bits" refers to the equivalent number of bits in 
standard integer arithmetic that the accumulator would 
be implemented to hold. 

"Alignment Slots" refers to the implementation of Gl all 
diagrams and Adders D7, D8 and D9 in FIG. 3. Specific 
Details regarding each implementation will be dis- 
cussed in the note regarding each circuit referenced in 
the "Remarks'* column. 



TABLE 10 













Cyc 


Cyc 


Typical 


Typical 








Align- 




El + 


Start 


to 


Adder 


Register 






Acc 


ment 


Adder 


HI 


to 


start 


Cell 


Bit 




Operation 


Bits 


Slots 


Cells 


Bits 


End 


next 


Count 


Count 


Remarks 


Mul 8*16 


40 


2 


172 


120 


2 


1 


128 


80 


Allows 2 W 
accumulations 
Note 1 


Mul 










3 


2 


256 


80 


Allows 2 B 


16*16 


















accumulations 


Mul8*l6 


48 


3 


180 


128 


2 


1 


128 


96 


Allows 2 M 
accumulations 
Note 2 


Mul 












2 


256 


96 


Allows 2 ia 


16*16 


















accumulations 


Mul 












3 


384 


96 


Allows 2 8 


16*24 


















accumulations 


Mul 8M6 


56 


4 


188 


136 




1 


128 


112 


Allows 2 M 
accumulations 
Note 3 


Mul 












2 


256 


112 


Allows I 24 


16M6 


















accumulations 


Mul 












3 


384 


112 


Allows 2 i0 


24*16 


















accumulations 


Mul 












4 


576 


112 


Allows 2* 


32*16 


















accumulations 
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"Adder Cells" refers to the number of adder cells needed 
to implement the adders involved in implementing the 
noted circuit based upon this patent's relevant block 
diagram. Unless otherwise noted, the adder cells will be 
two input cells, i.e. they perform the sum of two 5 
numbers. In cases where not only 2-input but also 
3-input adder cells are involved, the notation used will 
be "a^" where a represents the number of 2-input adder 
cells and b represents the number of 3-input adder cells. 

"El+Hl Bits'* will refer to the number of bits of memory 10 
storage required to build the circuit assuming a radix-2 
redundant binary arithmetic notation. 

"Cyc Start to End" refers to the number of clock cycles 
from start of the operation until all activity is com- 
pleted. 15 

"Cyc to start next" refers to the number of clock cycles 
from the start of the operation until the next operation 
may be started. 

"Typical Adder Cell Count" represents a circuit directly ^ 
implementing the operation with an accumulating final 
adder chain with no middle pipe register or alignment 
circuitry. Larger multiplications will require bigger 
adder trees. The columnar figure will be based upon 
using a similar small bit multiplier cell as described in ^ 
the appropriate discussion of multipliers 300. 

"Typical Register Bit Count** refers to the number of bits 
of memory that a typical design would require to hold 
a radix-2 redundant binary representation of the accu- 
mulator alone in a typical application. 3Q 

"Remarks** contains a statement regarding the minimum 
number operations the circuit could perform before 
there was a possibility of overflow. 

The Remarks entry may also contain a reference to a 
"Note", which will describe the implementation details 35 
of the multiplier-accumulator circuit being examined. 
The row of the table the Note resides in describes the 
basic multiplication operation performed, the size of 
the accumulator, number of alignment slots. The Note 
will fill in details should as the weighting factor 40 
between the alignment slot entries and any other per- 
tinent details, comparisons and any other specific com- 
ments. 

Notes: 

Alignment in this new circuit is the same as multiplying 45 
the product by 1 and 2 8 -256. It is functionally equiva- 
lent to a 16 by 16 bit multiplier with follow-on local 
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carry propagate adder for accumulation. The equivalent 
circuit would require 256 adder cells and 80 bits of 
accumulator memory compared to 172 adder cells and 
120 bits of memory. Its clock cycle time is approxi- 
mately half that of the standard equivalent device and 
would have the same throughput as the standard imple- 
mentation. 

Alignment in this new circuit is the same as multiplying 
the product by 1, 2 8 =256 and 2 16 »256 2 . It is function- 
ally equivalent to a 16 by 24 bit multiplier with 
follow-on local cany propagate adder for accumula- 
tion. The equivalent circuit would require 384 adder 
cells and 96 bits of accumulator memory compared to 
180 adder cells and 128 bits of memory. The new 
circuit would require about half the logic of the stan- 
dard functional equivalent circuit. Its clock cycle time 
is approximately half that of the standard equivalent 
device. Throughput of the standard implementation 
would be once every one of its clock cycles (or two of 
this new circuit), whereas performance of 16 by 24 bit 
multiply could be performed every three cycles in the 
new circuit. However, the new circuit would be twice 
as fast at multiplying 8 by 16 bits and would have 
identical performance for 16 by 16 bit multiplications. 

Alignment in this new circuit is the same as multiplying 
the product by 1, 2 8 -256, 2 16 -256 2 and 2 24 -256 3 . It is 
functionally equivalent to a 16 by 32 bit multiplier with 
follow-on local carry propagate adder for accumula- 
tion. The equivalent circuit would require 576 adder 
cells and 112 bits of accumulator memory compared to 
188 adder cells and 136 bits of memory. The new 
circuit would require about a third the logic of the 
standard functional equivalent circuit. Its clock cycle 
time is approximately half that of the standard equiva- 
lent device. Throughput for a 16 by 32 bit multiplica- 
tion with the standard implementation would be once 
every one of its clock cycles (or two of this new 
circuit), whereas performance of 16 by 24 bit multiply 
could be performed every four cycles in the new circuit. 
However, the new circuit would be twice as fast at 
multiplying 8 by 16 bits, would have identical perfor- 
mance for 16 by 16 bit multiplications, as well as being 
able to perform a 16 by 24 bit multiplication every 3 
clock cycles. 

Table 11 illustrates Capability \fersus Size Comparison 
with N=24 based upon FIG. 17: 



TABLE 11 













Cyc 


Cyc 


Typcial 


Typical 






Align- 




El + 


Start 


to 


Adder 


Rcgistr 




Acc 


ment 


Adder 


HI 


to 


start 


Cell 


Bit 


Operation 


Bits 


Slots 


Cells 


Bits 


End 


next 


Count 


Count Remarks 


Mul 8*24 


48 


3 


236 


160 


3 


1 


192 


80 Allows 2 l ° 



accumulations 
Note 1 

Mul 16*24 4 2 384 96 Allows 2 e 

accumulations 

Mul 24*24 6 3 576 96 Allows 1 

operation 

Mul 8*24 64 4 244 184 3 1 192 128 Allows 2 32 

accumulations 
Note 2 

Mul 16*24 4 2 128 128 Allows 2 24 

accumulations 

Mul 24*24 5 3 576 128 Allows 2 16 



accumulations 
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TABLE 11-continued 
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TVDcial 
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Align- 




El + 


Start 


to 
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Rcgistr 






Acc 




Adder 


HI 


to 


start 


Cell 


Bit 




Operation 


Bits 


Slots 


Cells 


Bits 


End 




Count 


Count 




Mui 32-24 










65 


43 


1098 


128 


Allows 2 8 
accumulations 


Mul 8*24 


64 


64 


244 


312 


3 


1 


192 


256 


Allows 2 37 
accumulations 
Note 3 


Mul 16*24 










4 


2 


128 


256 


Allows 2 24 
accumulations 


Mu! 24*24 










5 


3 


576 


256 


Allows 2 l * 
accumulations 


Mul 32*24 










6 


4 


1098 


256 


Allows 2 6 
accumulations 


Fmul 










5 


3 


576 


256 


Allows 


24*24 


















indefinite 
number of 
accumulations 



Notes product circuitry require Gl to be able to mask digits. 

Alignment in this circuit is the same as multiplying the Integer performance regarding 2*24, 16*24, 24*24 and 
product by 1, 2 8 «256 and 2 i6 -256 2 . It is functionally 32*24 arithmetic is the same as that described in the 
equivalent to a 24 by 24 bit multiplier with follow-on 25 prcvious QOlc This circuit can also perform 40*24 arith- 
Iocal carry propagate adder for accumulation. The 4 . - . , . ... - rrrTv „„*u 

1 / ■ ■» u ei£ jj it j metic every 5 clock cycles, which has utility in FFTs with 

equivalent circuit would require 576 adder cells and 96 J J . 

bits of accumulator memory compared to 236 adder greater than IK complex points. 

cells and 160 bits of memory. The new circuit would Multiplier as a 16 by N multiplier-accumulator (N>«16) 
require about half the logic of the standard functional 30 Using 3-2 Booth Coding 

equivalent circuit. Its clock cycle time is approximately The Modified 3-2 bit Booth Multiplication Coding 

half that of the standard equivalent device. Throughput Scheme in multiplier block 300 

of the standard implementation would be once every ... . . ^ 0 . KT . , „ 

f , , 1 / ^_ r • vT The primary distinction between the 8 by N lmplemen- 

one of its clock cycles (or two of this new circuit), .... ■ ... . 

whereas performance of 24 by 24 bit multiply could be 35 taUon and trm implemenUhon is m me multipher block m 
performed every three cycles in the new circuit. In ihis implementation a version of Booth's Algorithm is 
However, the new circuit would be twice as fast at used to minimize the number of add operations needed. The 
multiplying 8 by 24 bits and would have identical Booth Algorithm is based upon the arithmetic identity 
performance for 16 by 24 bit multiplications. -2"~ 1 +2 W " 2 +. . . +2+1-2"!. The effect of this identity is that 

Alignment in this multiplier-accumulator is the same as 40 multiplication of a number by a string of l's can be 
multiplying the product by 1, 2 8 «256, 2 16 »2S6 2 and performed by one shift operation, an addition and a sub- 
2 24 -256 3 . It is functionally equivalent to a 24 by 32 bit traction. 

multiplier with follow-on local carry propagate adder m folJowing algorithm is based upon examining 3 
formation. IJe^valcnt circuit would require successive ^ a^^g to ve ^ xm an ^ or 

1098 adder cells and 128 bits of accumulator memory 45 , . - . . . . 

compared to 244 adder cells and 184 bits of memory. lbcD P rocessm e over 2 blt positions and repeating 

The multiplier-accumulator would require about a me P rocess - * ^own as the 3-2 bit coding scheme, 
quarter the logic of the standard functional equivalent There is a one bit overlap, the least significant bit of one 
circuit. Its clock cycle time would be less than half that examination is the most significant bit of its predecessor 
of the standard equivalent device. Throughput for a 24 50 examination. 

by 32 bit multiplication with the standard implcmen- Table 12 of 3-2 bit Booth Multiplication Coding Scheme: 
tation would be once every one of its clock cycles (or 

two of this multiplier-accumulator), whereas perfor- TABLE 12 

mance of 32 by 24 bit multiply could be performed 

every four cycles in the multiplier-accumulator. 55 

However, the multiplier-accumulator would be twice as 

fast at multiplying 8 by 24 bits, would have identical 

performance for 16 by 24 bit multiplications, as well as 

being able to perform a 24 by 24 bit multiplication 

every 3 clock cycles. 60 
This is the first of the multiplier-accumulators capable of 
performing single precision mantissa multiplication. It is 
specified as supporting an Extended Scientific Notation, 
which forces the implementation of dual accumulators. 
Alignment of a product is to any bit boundary, so that 65 
weights of every power of two must be supported. Trunca- 
tion of "dropped bits** in either the accumulator or partial Table 13 of C1-C8 for digits 0 to 30: 



B[i + 1] B[i] 


Bti-1] 


Opera- 
tion 


Remarks 


0 0 


0 


40 


String of 0's 


0 0 


1 


+A 


String of l's terminating at B[iJ 


0 1 


0 


+A 


Solitary 1 at B{i] 


0 1 


1 


+2A 


String of 1*8 terminating at B[i + 1] 


1 0 


0 


-2A 


String of l's starting at B[l + 1] 


1 0 


1 


-A 


String of l's terminating at B[i] 








plus String of l's starting at B[i + 1] 


1 1 


0 


-A 


String of l's starting at B[i] 


1 1 


1 


-0 


String of l's traversing all examined 








bits of B 
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TABLE 13 
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ABc 
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AfaRc 


AduBe 
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ABa 
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AlsBa 


AduBc 


AbuBc 


25 
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AB8 








24 


g 
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g 


A&B8 


AdiiRa 


Abu Be 


A9uBe 


23 


g 


g 


Q 


AB6 




AcuBa 




A8uBe 


22 


0 


0 


0 


AisB6 


AduBS 


AL..D* 

AOUDa 


ASniBc 


A7uBc 


21 


0 


0 


AB4 


AeuB6 


AcuB8 


AauBa 


ASuBc 




20 


0 


0 


A£sB4 


AduB6 


AbuB8 


A9uBa 




ASuBc 


19 


0 


AB2 


AeuB4 


AcuB6 


AauB8 


A8uBa 


AouBc 






0 


AfaB2 


AduB4 


AbuB6 


A9uB8 


A7uBa 


A5uBc 


A3uBc 


17 


ABO 


AcuB2 


AcuB4 


AauB6 


ASuBS 


A6uBa 


A4uBc 


AVZUIjC 


i s 

ID 


t\ISD*J 


AduB2 


AbuB4 


A9uB6 


A 'Tit no 


A5uBa 


AJUDC 


AluBe 


15 


AeuBO 


AcuB2 


AauB4 


A8aB6 


A6uB8 


A4uBa 


A2uBc 


AOuBc 


14 


AduBO 


AbuB2 


A9uB4 


A7uB6 


A5uB8 


A3uBa 


AluBc 


0 


13 


AcaBO 


AauB2 


A8uB4 


A6uB6 


A4uB8 


A2uBa 


AOuBc 
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12 


AbuBO 


A9uB2 


A7uB4 


A5uB6 


A3uB8 


AluBa 


0 


0 


11 


AauBO 


A8uB2 


A6uB4 


A4uB6 


A2uB8 


AOuBa 


0 


0 


10 


A9uB0 


A7UB2 


A5uB4 


A3uB6 


AluBS 


0 


0 


0 


9 


ASuBO 


A6uB2 


A4uB4 


A2uB6 


AOuBS 
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0 
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ATuBO 


A5uB2 


A3uB4 


AluB6 


0 


0 


0 


0 


7 


A6uB0 


A4uB2 


A2uB4 


A0uB6 
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0 


0 
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6 


A5uB0 


A3uB2 


AluB4 


0 


0 


0 


0 


0 


5 


A4uB0 


A2uB2 


A0uB4 


0 


0 


0 


0 


4 




A3uB0 


AJuB2 


0 


0 


0 


0 


0 


0 


3 


A2uB0 


A0uB2 


0 


0 


0 


0 


0 


0 


2 


AluBO 


0 


0 


0 


0 


0 


0 


0 


1 


AOuBO 


0 


0 


0 


0 


0 


0 


0 


0 



Implementation Parameters to achieve various requirements 
are summarized in the following table 14 that illustrates 
performance evaluation with (3,2) Booth Encoder Small Bit 
Multipliers Cells is shown in the following table of Capa- 
bility versus size comparison (N-16) based upon FIG. 1. 
The typical adder cell count in this table is based upon using 
a 3-2 bit Modified Booth Coding scheme similar in Table 12. 



TABLE 14 
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Cyc 


Typical 


Typical 
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Start 
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Adder 


Register 






Aoc 


ment 


Adder 
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start 


Cell 


Bit 




Operation 


Bits 


Slots 


Cells 


Bits 


End 


next 


Count 


Count 


Remarks 


Mul 16*16 


56 


2 


205 


148 


2 




128 


112 


Allows 2 24 accumulations 
Note 1 


Mul 16*32 










3 


2 


256 


128 


Allows 2 s accumulations 


Mul 16*16 


64 


3 


213 


156 


2 


1 


128 


128 


Allows 2 32 accumulations 
Note 2 


Mul 16*32 










3 


2 


256 


128 


Allows 2 ltf accumulations 


Mul 32*32 










6 


4 


512 


128 


Allows 1 operation 


Mul 16*16 


72 


4 


221 


164 


3 


1 


128 


144 


Allows 2 40 accumulations 
Note 3 


Mul 16*32 










4 


2 


256 


144 


Allows 2 24 accumulations 


Mul 32*32 










6 


4 


512 


144 


Allows 2 s accumulations 


Mul 32*48 










8 


6 


768 


144 


Allows 2 s accumulations 



standard equivalent device and would have the same 
throughput as the standard implementation. 
Alignment in this multiplier-accumulator is the same as 
multiplying the product by 1, 2 16 «65536 and (2 16 ) 2 . It 
is functionally equivalent to a 32 by 32 bit multiplier 
with follow-on local carry propagate adder for accu- 
mulation. The equivalent circuit would require 512 



Notes: 

Alignment in this multiplier-accumulator is the same as ^ 
multiplying the product by 1 and 2 16 «65536. It is 
functionally equivalent to a 16 by 32. bit multiplier 
with follow-on local carry propagate adder for accu- 
mulation. The equivalent circuit would require 256 
adder cells and 128 bits of accumulator memory com- 
pared to 205 adder cells and 148 bits of memory. It 65 
would have about the same amount of logic circuitry. 
Its clock cycle time is approximately half that of the 



adder cells and 128 bits of accumulator memory com- 
pared to 213 adder cells and 156 bits of memory. It 
would be about half the logic circuitry. Its clock cycle 
time is approximately half that of the standard equiva- 
lent device. 

It would take twice as long to perform a 32 by 32 bit 
multiply. The multiplier- accumulator would be twice as fast 
the standard circuit for 16 by 16 multiplication. It would 
perform a 16 by 32 bit multiplication at the same rate as the 
standard multiplier-accumulator would perform. 
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Alignment is the same as multiplying the product by 1, 
2 16 «65536, (2 16 ) 2 and (2 ,e ) 3 . It is functionally equivalent to 
a 32 by 48 bit multiplier with follow-on local carry propa- 
gate adder for accumulation. The equivalent circuit would 
require 768 adder cells and 144 bits of accumulator memory 5 
compared to 221 adder cells and 164 bits of memory. It 
would be about a third the logic circuitry. Its clock cycle 
time is approximately half that of the standard equivalent 
device. It would take three times as long to perform a 32 by 
48 bit multiply. The present multiplier-accumulator would 10 
be twice as fast the the standard circuit for 16 by 16 
multiplication. It would perform a 16 by 32 bit multiplica- 
tion at the same rate as the standard circuit would perform. 
It would perform a 32 by 32 bit multiplication in about twice 
as long as the standard circuit. 15 

The following table 15 illustrates a Capability versus size 
comparison (N-24) based upon FIG. 17. The typical adder 
cell count in this table is based upon using a 3-2 bit Modified 
Booth Coding scheme similar in Table 12. 
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a 32 by 48 bit multiplier with follow-on local carry 
propagate adder for accumulation. The equivalent cir- 
cuit would require 768 adder cells and 176 bits of 
accumulator memory compared to 303 adder cells and 
212 bits of memory. It would have about half as much 
logic circuitry. Its clock cycle time would be somewhat 
less than half the standard implementation. It would 
take 4 new circuit clock cycles to perform what would 
take 1 standard clock cycle (or 2 new circuit clock 
cycles) in the new circuit to perform. 
However, in one clock cycle, a 16 by 24 bit multiplication 
could occur and in two clock cycles either a 16 by 48 
or a 32 by 24 bit multiplication could occur. This circuit 
is half the size and for a number of important DSP 
arithmetic operations, either as fast or significantly 
faster than a standard circuit with the same capability. 
Multiplier as a 24 by N multiplier-accumulator (N>-24) 
Use ot a Modified 4-3 bit Booth Multiplication Coding 
Scheme 



TABLE 15 













Cyc 


Cyc 


Typical 


Typical 
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Start 
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Adder 
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Aoc 


ment 


Adder 


HI 


to 


start 
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Bit 




Operation 


Bits 


Slots 
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Bits 


End 


next 


Count 


Count 


Remarks 


Mul 16*24 


64 


2 


283 


196 


3 


1 


256 


128 


Allows 2 16 accumulations 
Note 1 


Mul 32*24 










4 


2 


448 


128 


Allows 2 s accumulations 


Mul 16*24 


88 


4 


303 


212 


3 


1 


280 


176 


Allrrtro ?■** »rrnrrnilntinn« 

Note 2 


Mul 32*24 










4 


2 


472 


176 


Alio™" 7 32 »«nnml»Hnim 


Mul 16*48 










5 


2 


465 


176 


Allows Z 2 * accumulations 


Mul 32*48 










6 


4 


768 


176 


Allows 2 s accumulations 
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Notes: 

Alignment is the same as multiplying the product by 1 and 
2 2 *-(2 8 ) 3 . It is functionally equivalent to a 32 by 24 bit 
multiplier with follow-on local carry propagate adder 
for accumulation. The equivalent circuit would require 40 
256 adder cells and 128 bits of accumulator memory 
compared to 205 adder cells and 148 bits of memory. 
It would have about the same amount of logic circuitry. 
Its clock cycle time is approximately half that of the 
standard equivalent device and would have the same 45 
throughput as the standard implementation. 

Alignment is the same as multiplying the product by 1, 
2**, 2 16 and 2 40 -2 lfr * 2 \ It is functionally equivalent to 



This embodiment primarily differs from its predecessors 
in the multiplier block 300. As before, a version of Booth's 
Algorithm is used to minimize the number of add operations 
needed. The following algorithm is based upon examining 
four successive bits, determining whether to perform an add 
or subtract, then processing over three bit positions and 
repeating the process. This is what has lead to the term 4-3 
bit coding scheme. There is a 1-bit overlap, the least 
significant bit of one examination is the most significant bit 
of its successor examination. 

Table 16 illustrates a Modified 4-3 Bit Booth Multiplica- 
tion Coding Scheme: 



TABLE 16 

13(i + 2] B[i + 1] B[i] B[i - 1] Operation Remark 
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0 


0 


0 


+0 


0 


0 


0 


1 


+A 


0 


0 


1 


0 


+A 


0 


0 


1 


1 


+2A 


0 


1 


0 


0 


+2A 


0 


1 


0 


1 


+3A 


0 


1 


1 


0 


+3A 


0 


1 


1 


1 


+4A 


1 


0 


0 


0 


-4A 


1 


0 


0 


1 


-3A 


1 


0 


1 


0 


-3A 


1 


0 


1 


1 


-2A 



string of O's 

string of 1 's terminating at Bfi] 
Solitary 1 at B[i] 

sting of I's terminating at Bfi + 1] 

Solitary 1 at Bfi + 1) 

String of t's terminating at B[i] 

plus solitary 1 at B[i + 1] 

Short string(-3) at B[i + 1] and B[i] 

String of l's terminating at Bfi + 2] 

String of l's starting at Bfi + 2] 

String of 1 'starting at B[i + 2] 

plus string of l's terminating at B[i] 

String of l's starting at B(i + 2] 

plus solitary 1 at B[i] 

String of l's starting at BJi + 2] 



43 
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TABLE 16-C0Dtinued 



Bfi + 2] B[i+1] 


Bfil 


Bfi - 


1] Operation Remark 








plus string of l's terminating at Bfi + 1] 


1 1 


0 


0 


-2A String of l's starting at Bfi + 1] 


1 1 


0 


1 


-A Siring of l's starting at B[i + 1] 








plus string of l's terminating at B(iJ 


1 1 


1 


0 


-A String of l's starting at Bfi] 


1 1 


1 


1 


-0 String of l's starting traversing all bits 



Optimal Double Precision Floating Point Mantissa Mul- 
tiplication 

An implementation based upon 24- by 32-bit multiplica- 
tion would be capable of performing a standard 56-bit 
precision floating point mantissa multiplication every two 
cycles. The 56-bit length comes from the inherent require- 
ment of IEEE Standard Double Precision numbers, which 
require a mantissa of 64-10 bits, plus two guard bits for 
intermediate rounding accuracy. Such an implementation 
would require only two alignment slots. An implementation 



of 16- by 24-bit multiplication would be capable of sup- 
porting the 56-bit floating point mantissa calculation, but 
15 with the liability of taking more clock cycles to complete. 
More alignment slots would be required. Such an imple- 
mentation would however much less logic circuitry as the 
application dedicated multiplier. Implementation of a p-adic 
mantissa for either p-3 or 7 would be readily optimized in 
such implementations. 

Table 17 of C1-C8 for digits 0 to 47 



TABLE 17 



Cl 


C2 


C3 


C4 


cs 


C6 


C7 


C8 


Digit k 


0 


0 


0 


0 


0 


U 


u 


AB15 


47 


o 


Q 


o 


o 


Q 


o 


o 


A19uRl5 


46 


o 


o 


o 


o 


o 


o 


Q 


AJ8uB15 


45 


o 


o 


o 


o 


o 


0 


AB12 


A17uBl5 


44 


0 


0 


0 


0 


0 


0 


A19uB12 


A16uB15 


43 


0 


0 


0 


0 


0 


0 


A18uB12 


Al5uB15 


42 


0 


0 


0 


0 


0 


ABf 


A17uB12 


A14uB15 


41 


0 


0 


0 


0 


0 


A19uBf 


A16uB12 


A13uB15 


40 


0 


0 


0 


0 


0 


AlSuBf 


A15uB12 


A12oB15 


39 


0 


0 


0 


0 


ABc 


A17uBf 


A14uB12 


AlloB15 


38 


0 


0 


0 


0 


A19oBc 


AlouBf 


A13uB12 


A10aB15 


37 


0 


0 


0 


0 


A18uBc 


A15uBf 


A12uB12 


A&B15 


36 


0 


0 


0 


AB9 


AlTuBc 


A14uBf 


AlluB12 


AcuB15 


3S 


0 


0 


0 


A19uB9 


A16uBc 


A13uBf 


A10uB12 


AduBIS 


34 


0 


0 


0 


A18uB9 


AlSuBc 


A12uBf 


A&BI2 


AcuBl5 


33 


0 


0 


AB6 


AJ7uB9 


A14uBc 


AlluBf 


AeuB12 


AbuBIS 


32 


0 


0 


A19uB6 


AJ6uB9 


A13uBc 


AlOuBf 


Adufil2 


AauBIS 


31 


0 


0 


A18uB6 


AJ5uB9 


A12uBc 


AfcBf 


AcuB12 


A9uB15 


30 


0 


AB3 


A17uB6 


A14uB9 


AlluBc 


AeuBf 


AbuB12 


A8uB15 


29 


0 


A19uB3 


AJ6uB6 


A13uB9 


AlOuBc 


AduBf 


AauBI2 


ATuBIS 


28 


0 


A18uB3 


A15uB6 


A12uB9 


AfisBc 


AcnBf 


A9uB12 


A6uB15 


27 


ABO 


A17uB3 


A14oB6 


AlluB9 


AcaBc 


AboBf 


A8uB12 


A5uB15 


26 


A19bB0 


A16uB3 


A13uB6 


A10uB9 


AduBc 


AmiBf 


A7oB12 


A4oB15 


25 


A18sB0 


AlSuB3 


A12uB6 


A&B9 


AcuBc 


A9oBf 


A6uB12 


A3uB15 


24 


A17sB0 


AJ4uB3 


AlluB6 


AeuB9 


AbuBc 


ASuBf 


A5uB12 


A2uB15 


23 


A16sB0 


A13uB3 


A10uB6 


AduB9 


AauBc 


ATuBf 


A4uB12 


AluBlS 


22 


A15sB0 


A12ufi3 


A£sB6 


AcaB9 


A9uBc 


AouBf 


A3uB12 


AOuBlS 


21 


A14eB0 


AlluB3 


AeuB6 


AbuB9 


ASuBc 


ASuBf 


A2uB12 


0 


20 


A13sB0 


A10uB3 


AduB6 


AauB9 


A7uBc 


A4uBf 


AluB12 


0 


19 


A12&B0 


A&B3 


AcuB6 


A9uB9 


AouBc 


A3uBf 


A0uB12 


0 


18 


AllsBO 


AeuB3 


AbuB6 


A8uB9 


A5ufic 


A2uBf 


0 


0 


17 


AlOsBO 


AduB3 


AauB6 


A7uB9 


A4uBc 


AluBf 


0 


0 


16 


AfsBO 


AcuB3 


A9uB6 


A6uB9 


A3uBc 


AOuBf 


0 


0 


15 


AeufiO 


AbuB3 


A8uB6 


A5uB9 


A2uBc 


0 


0 


0 


14 


AcluBO 


AauB3 


A7uB6 


A4uB9 


AluBc 


0 


0 


0 


13 


AcuBO 


A9uB3 


A6uB6 


A3uB9 


AOuBc 


0 


0 


0 


12 


AbufiO 


A8uB3 


A5uB6 


A2uB9 


0 


0 


0 


0 


11 


AauBO 


A7uB3 


A4uB6 


AluB9 


0 


0 


0 


0 


10 


A9uB0 


A6uB3 


A3uB6 


A0uB9 


0 


0 


0 


0 


9 


A8uB0 


A5uB3 


A2uB6 


0 


0 


0 


0 


0 


8 


A7uB0 


A4uB3 


AluB6 


0 


0 


0 


0 


0 


7 


AGuBO 


A3uB3 


A0uB6 


0 


0 


0 


0 


0 


6 


A5uB0 


A2uB3 


0 


0 


0 


0 


0 


0 


5 


A4uB0 


AluB3 


0 


0 


0 


0 


0 


0 


4 


A3uB0 


A0uB3 


0 


0 


0 


0 


0 


0 


3 


A2uB0 


0 


0 


0 


0 


0 


0 


0 


2 


AluBO 


0 


0 


0 


0 


0 


0 


0 


1 


AOuBO 


0 


0 


0 


0 


0 


0 


0 


0 
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45 

The following table 18 illustrates the performance evalu- 
ation of Capability versus size comparison (N«24) based 
upon FIG. 17. The typical adder cell counts in the above 
table are based upon a multiplier design using a 4-3 bit 
Modified Booth Encoding Algorithm. 



Bl 

46 

ucl is to any bit boundary, so that weights of every 
power of two must be supported. Truncation of 
"dropped bits" in either the accumulator or partial 
product circuitry require Gl to be able to mask digits. 
Integer performance is the same as that described in the 



TABLE 18 













Cyc 


Cyc 


Typical 


Typical 








Align- 




El + 


Start 


to 


Adder 


Register 






Acc 


ment 


Adder 


HI 


to 


start 


Cell 


Bit 




Operation 


Bils 


Slots 


Cells 


Bits 


End 


next 


Count 


Count 


Remarks 


Mul 24*24 


56 


1 


272 


244 


3 


1 


272 


112 


Allows 2* accumulations 
Note 1 


Mul 24*24 


80 


2 


296 


292 


3 


1 


296 


160 


Allows 2 32 accumulations 
Note 2 


Mul 24*48 










4 


2 


512 


160 


Allows 2 8 accumulations 


Mul 24*24 


64 


64 


280 


260 


3 


1 


576 


256 


Allc*"* 7 lS »rrnmiil*r»rm«i 

Note 3 


FMul 24*24 










33 


12 




256 


Allows indefinite number of 
accumulations Allows 2 s 
accumulations 


Mul 24*24 


48 


16 


264 


260 


3 


1 


576 


192 


Allows 1 operation 


P-adic 


















Note 4 


P-adic FMul 










3 


1 




192 


Allows indefinite number of 


24*24 


















accumulations 



Note: 

The primary advantage of this circuit is that it performs 
twice as many mul tiply-accumul ales in the same period 
of time as the standard implementation. It is somewhat ^ 
larger, due to the memory bits in the El circuit. 

Alignment in this new circuit is the same as multiplying 
the product by 1 and 2 24 ^2 8 ) 3 . It is functionally 
equivalent to a 24 by 48 bit multiplier with follow-on 
local carry propagate adder for accumulation. The 35 
equivalent circuit would require 512 adder cells and 
160 bits of accumulator memory compared to 296 
adder cells and 292 bits of memory. It would have 
about 60% as much logic circuitry. Its clock cycle time 
is approximately half that of the standard equivalent ^ 
device. The new circuit would have the same through- 
put as the standard implementation for 24 by 48 bit 
multiplications, but for 24 by 24 bit multiplications, 
would perform twice as fast 

This circuit is capable of performing single precision 45 
mantissa multiplication. It is specified as supporting an 
Extended Scientific Notation, which forces the imple- 
mentation of dual accumulators. Alignment of a prod- 



previous note. Note that the present multiplier- 
accumulator can support a new single precision floating 
point multiplication-accumulation every clock cycle. 

This is the first circuit discussed in this patent capable of 
p-adic floating point support; P»7. Since alignment is at 
p-digit boundaries, a 48 bit (which is 16 p-digits) 
accumulator only requires 16 alignment slots, making 
its implementation of the alignment mechanism much 
less demanding. The adder cells used here are p-adic 
adder cells, which are assuming to work on each of the 
three bits of a redundant p-digit notation. These adder 
cells may well be different for each bit within a digit, 
but wuT be counted as having the same overall com- 
plexity in this discussion. The primary advantage of 
this circuit is that its performance is twice the perfor- 
mance of the standard implementation. 

Multiplier as 16 by N using a 4-3 Booth Coding Scheme 
in FIG. 18 

Multiplier 300 circuitry 

Table 19 illustrates coefficient generation for multipliers 
300: 



TABLE 19 



ci 


C2 


C3 


C4 


C5 


C6 


C7 


C8 


Digit 


0 


0 


0 


0 


0 


ABf 


Zlf 


0 


31 


0 


0 


0 


0 


0 


AfsBf 


Zle 


0 


30 


0 


0 


0 


0 


0 


AeuBf 


Zld 


0 


29 


0 


0 


0 


0 


ABc 


AduBf 


Zlc 


0 


28 


0 


0 


0 


0 


AfkBc 


AcuBf 


Zlb 


0 


27 


0 


0 


0 


0 


AeuBc 


AbuBf 


Zla 


0 


26 


0 


0 


0 


AB9 


AduBc 


AauBf 


Z19 


0 


25 


0 


0 


0 


AfsB9 


AcuBc 


A9uBf 


Z18 


0 


24 


0 


0 


0 


AeuB9 


Abu Be 


A8uBf 


7.17 


0 


23 


0 


0 


AB6 


AduB9 


AauBc 


A7uBf 


Z16 


0 


22 


0 


0 


A£sB6 


AcuB9 


A9uBc 


A6uBf 


Z15 


0 


21 


0 


0 


AeuB6 


AbuB9 


ASuBc 


ASuBf 


Z14 


0 


20 


0 


AB3 


AduB6 


AauB9 


ATuBc 


A4uBf 


Z13 


0 


19 


0 


AfsB3 


AcluB6 


A9uB9 


A6uBc 


A3uBf 


Z12 


0 


18 


0 


AeuB3 


Abu Bo 


A8uB9 


A5uBc 


A2ul3f 


Zll 


0 


17 


ABO 


AduB3 


AauB6 


A7uB9 


A4uBc 


AluBf 


Z10 


0 


16 
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TABLE 19-coDtinued 



CI 


C2 


C3 


C4 


C5 


C6 


C7 


C8 


Digit k 


AfsBO 


AcuB3 


A9uB6 


A6uB9 


A3uBc 


AOuBf 


Zf 


0 


15 


AeuBO 


AbuB3 


A8uB6 


A5uB9 


A2uBc 


0 


Ze 


0 


14 


AduBO 


AauB3 


A7uB6 


A4uB9 


AluBc 


0 


Zd 


0 


13 


AcuBO 


A9uB3 


A6uB6 


A3uB9 


AOuBc 


o 


Zc 


o 


12 


AbuBO 


A&iB3 


A5uB6 


A2uB9 


0 


0 


Zb 


0 


11 


AauBO 


A7uB3 


A4uB6 


AJuB9 


0 


0 


Za 


0 


10 


A9uB0 


A6uB3 


A3uB6 


A0uB9 


0 


0 


Z9 


0 


9 


A&ifiO 


A5uB3 


A2uB6 


0 


0 


0 


28 


0 


8 


ATuBO 


A4uB3 


AluB6 


0 


0 


0 


Z7 


0 


7 


A6uB0 


A3uB3 


A0uB6 


0 


0 


0 


Z6 


0 


6 


A5ufi0 


A2uB3 


0 


0 


0 


0 


Z5 


0 


5 


A4uB0 


AluB3 


0 


0 


0 


0 


Z4 


0 


4 


A3uB0 


A0oB3 


0 


0 


0 


0 


23 


0 


3 


A2uB0 


0 


0 


0 


0 


0 


22 


0 


2 


AluBO 


0 


0 


0 


0 


0 


Zl 


0 


1 


AOuBO 


0 


0 


0 


0 


0 


zo 


0 


0 



Trimmed Adder Tree Requirements 

Examination of Table 19 shows that Adder D4 is not 
needed to achieve a fixed point polynomial step implemen- 
tation. Adder D4 and D6 would be unnecessary for imple- 
mentations which did not support single cycle polynomial 
step operations. 

Implementation of polynomial step operations 

Fixed point arithmetic polynomial step calculations 
would not need Adder D4. 

The assumption would be that the computation's preci- 
sion would match or be less than N bits, so that the Z input 
in this case would be 16 bits, which would be aligned to the 
most significant bits of the product Integer arithmetic poly- 
nomial step calculations would also not need Adder D4. The 
major difference would be that the offset in such a situation 
would be assumed to be of the same precision as the result 
of the multiplication, so that Z would be assumed to be 32 
bits. 

Table 20 illustrates Performance versus Size for N-16. 



standard circuit and the same performance for 16 by 32 
bit multiplies. 

This new circuit has alignment weights of 1, 2 16 and 
2 32 *<2 ie ) 2 . It possesses about half of the logic of a 
standard implementation. It performs one 32 by 32 bit 
multiply in 4 of its clock cycles, compared to the 
standard implementation taking about 2 new circuit 
clock cycles. 

However, it performs a 16 by 16 bit multiply every clock 
cycle, which is twice as fast as the standard implementation. 

This new circuit has alignment weights of 1, 2 16 , 2 32 - 
(2 16 ) 2 and 2 48 «(2 16 ) 3 . It possesses about a third of the 
logic of a standard implementation. It performs one 32 
by 48 bit multiply in 6 of its clock cycles, compared to 
the standard implementation taking about 2 new circuit 
clock cycles. However, it performs a 16 by 16 bit 
multiply every clock cycle, which is twice as fast as the 
standard implementation. 

The basic difference in the MAC of FIG. 20 and the above 
MAC of FIG. 19 is that there are an additional four numbers 



TABLE 20 













Cyc 


Cyc 


Typical 


Typical 






Align- 




El + 


Start 


to 


Adder 


Register 




Ace 


ment 


Adder 


HI 


to 


start 


Cell 


Bit 


Operation 


Bits 


Slots 


Cells 


Bits 


End 


next 


Count 


Count Remarks 


MuJ 16*16 


40 


1 


148 


132 


2 


1 


196 


80 Allows 2* accumulations 
Note 1 


Mul 16*16 


56 


2 


196 


148 


2 


1 


196 


112 Allows 2 1 * Bccurmilaiiom 
Note 2 


Mul 16*32 










3 


2 


300 


112 Allows 2* accumulations 


Mul 16*16 


64 


3 


220 


156 


2 


1 


220 


128 Allows 2 32 accumulations 
Note 3 


Mul 16*32 










3 


2 


316 


128 Allows 2 10 accumulations 


Mul 32*32 










5 


4 


600 


144 Allows 2 s accumulations 


Mul 16*16 


88 


4 


270 


196 


2 


1 


270 


176 Allows 2 s * accumulations 
Note 4 


Mul 16*32 










3 


2 


374 


176 Allows 2*° accumulations 


Mul 32*32 










5 


4 


648 


176 Allows 2" accumulations 


Mul 32*48 










8 


6 


900 


176 Allows 2 B accumulations 



Notes: 60 generated in multiplier block 300, 0-02. This requires six 

This circuit has as its major advantage being able to holders DI-D6 on the output The Adders D5 and D6 extend 

perform twice as many multiply-accumulates in the the precision of the multiplication which can be accom- 

same time as a standard implementation. plisbed by 50% beyond that which can be achieved by a 

Alignment weights are the same as multiplying by 1 and comparable circuit of the basic Multiplier described above. 

2 16 . This circuit has about 70% of the standard multi- 65 A 32 bit by N bit single cycle multiplication could be 

plier circuit capable of the same operations. It has twice achieved without the necessity of D6. In such an 

the performance for 16 by 16 bit multiplies as the implementation, D6 would provide the capability to imple- 
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meat a polynomial step operation of the form X*Y+Z, where 
X and Z are input numbers and Y is the state of an 
accumulator register contained in HI. This would be 
achieved in a manner similar to that discussed regarding 
FIGS. 18 and 19. Such an implementation would require at 
least two accumulator registers in HI for optimal perfor- 
mance. If N>-32, then with the appropriate alignment slots 
in Gl and G2, these operations could support multiple 
precision integer calculations. Such operations are used in 
commercial symbolic computation packages, including 
Mathematics, Macsyma, and MAPLE V, among others. 

An implementation of 28 by N bit multiplication would be 
sufficient with the use of D6 to provide offset additions 
supporting two cycle X*Y+Z polynomial step calculation 



50 



store the result. The primary performance improvement 
comes from being able to handle more bits in parallel in one 
clock cycle. The secondary performance improvement 
comes from being able to start a second operation while the 

5 first operation has traversed only about half the adder tree as 
in the primary circuitry discussion. The third performance 
improvement comes from the ability to perform multiple- 
precision calculations without significantly affecting the size 

10 of the circuit. An implementation based upon this diagram 
with a trimmed adder tree can support 32 by N bit multiply- 
accumulates. 

Table 21 illustrates a Trimmed adder tree supporting 32 by 
32 Multiplication (Performance versus Size for N=32). 



TABLE 21 













Cyc 


Cyc 


typical 


Typical 








Align- 




El + 


Start 


to 


Adder 


Register 






Acc 


ment 


Adnei 


HI 


to 


start 


Cell 


Bit 




Operation 


Bits 


Slots 


Cells 


Bits 


End 


next 


Count 


Count 


Remarks 


Mul 32*32 


80 


1 


508 


400 


2 


1 


508 


160 


Allows 2 16 accumulations 
Note 1 


Mul 32*32 


112 


2 


572 


464 


2 


1 


572 


224 


Allows 2 s * accumulations 
Note 2 


Mul 32*64 










3 


2 


860 


224 


Allows 2 10 accumulations 


Mul 32*32 


144 


3 


636 


528 


2 


1 


636 


288 


Allows 2 80 accumulations 
Note 3 


Mul 32*64 










3 


2 


924 


288 


Allows 2 48 accumulations 


Mul 64*64 










5 


4 


1664 


288 


Allows 2 l * accumulations 


Mul 32*32 


160 


4 


672 


560 


2 


1 


668 


320 


Allows 2 s0 accumulations 
Note 4 


Mul 32*64 










3 


<2 


960 


320 


Allows 2 40 accumulations - . 


Mul 64*64 










5 


4 


1694 


320 


Allows 2 10 accumulations 


Mul 64*96 










8 


6 


2176 


320 


Allows 2 s accumulations 



support for Standard Double Precision Floating Point man- 
tissa calculations. 

Implementations of either of the last two implementations 
which contained four accumulation registers in HI would be 
capable of supporting Extended Precision Floating Point ^ 
Mantissa Multiplication/Accumulations acting upon two 
complex numbers, which is a requirement for FORTRAN 
runtime environments. Any of the above-discussed imple- 
mentations could be built with the capability of supporting 
p-adic floating point operations of either Standard or 
Extended Precision Floating Point, given the above discus- 45 
sion. Adder chains D7, D8 and D9 are provided on the 
output of Adders D1-D6 in a true configuration. These 
Adder chains D7, D8 and D9 take as inputs the results of Dl, 
D2, D3, D4, D5 and D6, respectively. The primary Multi- 
plier does not contain D9. It is specific to the embodiment 50 
discussed herein. 

As in the initial Multiplier/Accumulator architecture of 
FIG. 17, the inputs of Adder D10 are the results of Adders 
D7 and D8, which have been registered in Block El. Adder 
DU takes as inputs the aligned results of Adder D9 and 55 
aligned results of selected memory contents of HI. In this 
embodiment to the Basic Multiplier/ Accumulator Architec- 
ture. Adder Dll takes as inputs the aligned results of Adder 
D9 and aligned results of selected memory contents of HI. 
The alignment mentions in the last sentence is performed by 60 
Gl. The aligned results of Adder D9 have traversed El, 
where they synchronously captured. 

Adder D12 receives the aligned results of the Adders D10 
and the results of Adder Dll. G2 aligns the results of Adder 
D10 prior to input of this aligned signal bundle by Adder 65 
D12. The results of its operation are sent to Block HI, where 
one or more of the registers^) internal to Block HI may 



Notes: 

This circuit performs twice as many multiply- 
accumulates in the same time as a standard implemen- 
tation. 

Alignment weights for this circuit are the same as mul- 
tiplying by 1 and 2 32 . This circuit has about 70% of the 
standard multiplier circuit capable of the same opera- 
tions. It has twice the performance for 32 by 32 bit 
multiplies as the standard circuit and the same perfor- 
mance for 32 by 64 bit multiplies. 

This circuit has alignment weights of 1, 2 32 and 2 64 ^2 32 ) 
2 . It possesses less than half of the logic of a standard 
implementation. It performs one 64 by 64 bit multiply in 4 
of its clock cycles, compared to the standard implementation 
taking about two circuit clock cycles. 

However, it performs a 32 by 32 bit multiply every clock 
cycle, which is twice as fast as the standard implementation. 

This circuit has alignment weights of 1, 2 32, M4 «(2 32 ) 2 
and 2 96 -(2 32 ) 3 . It possesses about a third of the logic of a 
standard implementation. It performs one 64 by 96 bit 
multiply in 6 of its clock cycles, compared to the standard 
implementation taking about two circuit clock cycles. 
However, it performs a 32 by 32 bit multiply every clock 
cycle, which is twice as fast as the standard implementation. 

Referring now to FIGS. 21 and 22, there are illustrated 
two additional embodiments of the MAC 68. Both of these 
FIGS. 21 and 22 support single-cycle double precision 
floating point mantissa multiplications. They may be imple- 
mented to support Extended Scientific Floating Point Nota- 
tions as well as p-adic floating point and extended floating 
point with the same level of performance. FIG. 21 represents 
a basic multiplier-accumulator. FIG. 22 represents an 
extended circuit which supports optimal polynomial calcu- 
lation steps. 
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Use of 4-3 Modified Boolh Muliiplication Encoding will 
be assumed for multiplier block 300. The support of small 
p-adic floating point mantissa or Modular Arithmetic mul- 
tiplication would require a modification of this scheme. The 
18 partial products which are generated support the 54 bit 
mantissa fields of both standard double precision and also 
p-7 p-adic double precision. These FIGS. 21 and 22 repre- 
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second operation while the first operation has traversed only 
about half the adder tree as in the primary circuitry discus- 
sion. 

he following Table 22 describes the performance analysis 
of Multipliers with two accumulators capable of supporting 
Extended Scientific Double Precision Standard and p=7 
p-adic multiplication-accumulation on every cycle. 



TABLE 22 













Cyc 


Cyc 


Typical 


Typical 






Acc 


Align- 




El + 


Start 


to 


Adder 


Register 






(2) 


ment 


Adder 


HI 


to 


start 


an 


Bit 




Operation 


Bits 


Slots 


Cells 


Bits 


End 


next 


Count 


Count 


Remarks 


FMul 


256 


128 


475(3) 


932 


2 


1 


475(3) 


512 


Note 1 


54-54 






338(2) 








338(2) 






PFMul 


216 


36 


475(3) 


812 


2 


1 


475(3) 


432 


Note2 


18*18 






298(2) 








298(2) 







sent circuitry thus capable of 54 by 54 bit standard mantissa 
multiplication as well as 18 by 18 digit (54 bits) p-adic 
mantissa calculation. 

Starting from the left, the first layer of adders (D1-D6) on 
the output of multiplier block 300 and the third layer of 25 
adders (D10) on the output of pipeline registers El are the 
sum of three-number adder chains. The second and fourth 
layers of adders (D7-9 and DU) are the sum of two number 
adders. The alignment circuitry Gl and the use of an adder 
ring in Dll provide the alignment capabilities needed for the 30 
specific floating point notations required. Circuitry in HI 
may be implemented to support Extended Scientific Nota- 
tions as well as optimize performance requirements Cor 
Complex Number processing for FORTRAN. The functions 
performed by Jl are not substantially different from the 35 
above-noted embodiments. 

With further reference to FIG. 21, the major item to note 
is that there are an additional six numbers generated in 
multiplier block 300 beyond what FIG. 20 could generate. 
The Adders Dl to D6 each add three numbers represented by 40 
the signal bundles CI to C18. Standard, as well as p=7 
p-adic, floating point double precision mantissa multiplica- 
tions require 54 bit (18 p=7 p-adic digit) mantissas. This 
multiplier block 300 would be able to perform all the small 
bit multiplications in parallel. The results of these small bit 45 
multiplications would then be sent to Adders Dl to D6 to 
create larger partial products. 

The adder chains D7, D8 and D9 take as inputs the results 
of Dl, D2, D3, D4, D5 and D6, respectively. The primary 
Multiplier claimed does not contain D9. It is specific to the 50 
embodiment being discussed here. Adder D10 also sums 
three numbers. The inputs of Adder D 10 are the results of 
Adders D7, D8 and D9, which have been registered in Block 
El. Adder Dll receives the aligned results of the Adders 
D10 and the selected contents of HI. Gl align:, the results 55 
of Adder D10. The results of its operation are sent to Block 
HI, where one or more of the registers^) internal to Block 
HI may store the result. 

Register Block HI and Interface Jl have an additional 
function in FIG. 22: The ability to be loaded with an 60 
additional number M Y" which may then be used to compute 
B*Z+Y. The primary performance improvement comes from 
being able to handle a double precision mantissa multipli- 
cation every clock cycle with the necessary accumulators to 
support Extended Scientific Precision Floating Point for 65 
either standard or p=7 p-adic arithmetic. The secondary 
performance improvement comes from being able to start a 



Note: 

This design implements standard double precision man- 
tissa multiplication-accumulate targeting extended sci- 
entific notation accumulators. 

This notation requires dual accumulators of twice the 
length of the mantissa. Minimally, 108 alignment slots 
would be sufficient. For simplicity of design, the align- 
ment slots are made a power of two. This drives the 
requirement of accumulators holding 128 bits in the 
redundant binary notation. Note that complex number 
support would double the number of accumulators 
required. Such support is needed for FORTRAN and 
optimal for Digital Signal Processing applications 
based upon complex number arithmetic. 

The number of adder cells is decomposed into two types: 
those which sum 3 numbers (3) and those sum two 
numbers(2). These adder cell numbers represent the 
cells in the respective adders Dl-Dll as all being of the 
same type, which is a simplification . 

The primary difference between this and a standard 
approach is performance: the new circuit performs 
twice as many multiplies in the same amount of time. 

Use of FIG. 22-based circuitry enhances performance by 
permitting polynomial calculation step optimization. 
This represents a speedup of a factor of two in these 
calculations. 

This design implements p-7 p-adic double precision 
mantissa multiplication-accumulate targeting extended 
scientific notation acculators. 

Double length accumulators require 36 digit storage, 
which poses a problem: if the approach taken in new 
circuit l(simplicity of the alignment slots) were used 
here, it would require 64 alignment slots, resulting in 
64 digit accumulators. This is a lot more accuracy than 
would seem warranted. The assumptions made here are 
that there are 36 alignment slots, with 36 redundant 
p-adic digits required of each of the two accumulators. 
Each redundant p-adic digit will be assumed to require 
6 bits of memory. 

Note that complex number support would double the 
number of accumulators required. Such support is 
needed for FORTRAN and optimal for Digital Signal 
Processing applications based upon complex number 
arithmetic. 

It will be further assumed that each digit of the redundant 
p-adic adder cell is roughly equivalent to 3 of the 
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redundant binary adder cells. The Dumber of adder cells writing. The contents of block 342 are assumed to change 
is decomposed into two types: those which sum 3 only when Lhe memory circuit performs a read. Note that 
numbers (3) and those sum two numbers(2). These address generation and read/write control signal bundles are 
adder cell numbers represent the cells in the respective i e ft silent in this diagram to simplify the discussion, 
adders Dl-Oll as all being of the same type, which is 5 The three-port memory block 332 contains a write data 
a simplification. multiplexor block 344, labeled 'K7:3-port Write Mux' 
Since there is no known equivalent circuit, comparison is which selects between the input register signal bundles 
more hypothetical: this circuit's throughput is twice a 'irO-irn* and the output register signal bundles *orO-oro\ 
circuit lacking the El pipe registers. The selected single bundle is sent to the write port of the 
Use of FIG. 22-based circuitry enhances performance by 10 memory. The read ports send their signal bundles to a read 
permitting polynomial calculation step optimization. register block 346, labeled 'L9:3-port Rdl Reg' and a read 
This represents a speedup of a factor of two in these register block 348, labeled *L10:3-port Rd2 Reg', which 
calculations. synchronize these signals for use elsewhere. This memory 
Referring now to FIG. 23, there is illustrated a block 332 can perform two read and one write access in a clock 
diagram of a Multiplier Block with minimal support Cir- 15 cycle. The contents of 346 and 349 are assumed to change 
cuitry. A Multiplier-Accumulator Block 310 contains a only when the memory circuit performs a read. Note that 
multiplier-accumulator comprised of a multiplier 312 and an address generation and read/write control signal bundles are 
accumulator 314, as described hereinabove, phis an input left silent in this diagram to simplify the discussion, 
register block 316 labeled 'L^MuUnReg'. Signal bundles Referring now to FIG. 25, there is illustrated a block 
whose sources are external to this circuit are selected by a 20 diagram of a Multiplier-Accumulator with Multiplicity of 
plurality of multiplexors 318 labeled 'K2:IN Mux(s)\ The Adders, and one-port and three-port Memories. This circuit 
selected signal bundles are synchronously stored in the incorporates all the functional blocks of FIG. 24 plus one or 
memory of a block 320 labeled 'LI: IN Reg(s)'. The inputs more additional Adder blocks, each containing a multiplicity 
to the Multiplier- Accumulator block 310 are selected by a of Accumulators 350, labeled 'L7:AddAcc(s)\ Adder input 
multiplexor circuit 322 labeled *K3:Mult Mux(s) f . A plural- 25 multiplexing may be independently controlled to each Adder 
ity of signals bundles from block 322 would then be sent to Block. Multiple signal bundles (ac[l,0] to ac[pk]) are 
322 and to a block 324 labeled 'K4:Add Mux(s)\ assumed to be generated from these Adder Blocks. Any 
The K4 block selects between synchronized externally adder status signals, such as overflow, equality, zero detect, 
sourced signal bundles coming from the block 320 and the etc., are assumed synchronously stored and made available 
contents (or partial contents) of selected memory contents of 30 to the appropriate control signal generation circuitry. These 
the accumulator block 314 labeled 'L4:MulAcReg(s)*. status signal bundles, synchronizing circuitry and control 
These signal bundles are then synchronously stored in the signal generation circuitry are left silent in this figure for 
memory contents of a block 326, labeled 'L5:AddInReg' in reasons of simplicity. The Multiplier Multiplexor 332 is 
an Adder block 328. The Adder is considered to optionally extended to select any from the generated adder signal 
possess a mid-pipe register block labeled 'L6:AddMidReg 35 bundles (ac[l,0] to ac[p,k]). The Output Register Multi- 
is)'. Hie synchronous results of the Adder are stored in the plexor 334 is extended any from the generated adder signal 
memory components) of the block labeled 'L7:AddAccReg bundles (ac[l,0] to ac[p,k]). 

(s)\ In the simplest implementations, the following compo- The basic Advantages of Circuit represented by FIGS. 23 
nents would not be populated: K2, LI, K3, K4 and L6. to 25 will now be described. Circuitry based upon FIG. 23 
Referring now to FIG. 24, there is illustrated a block 40 incorporates the advantages of the implemented multiplier- 
diagram of a Multipher-Accumulator with Basic Core of accumulators based upon the embodiments described here- 
Adder, one-port and three-port Memories. This circuit incor- inabove. The major systems limitation regarding multipliers 
porates all the functional blocks of FIG. 23 7 plus a one-port is efficiently providing operands to the circuitry. The 
memory 330, similar to one -port memory 44, a three-port embodiment of FIG. 23 does not address this problem, 
memory 322, similar to three-port memory 43, output reg- 45 Circuitry based upon FIGS. 24 and 25 solves the systems 
ister multiplexors 334 and output registers 336. The Multi- limitation in FIG. 23 for a broad class of useful algorithms 
plier's input selector 322 now selects between signal which act upon a stream of data. A stream of data is 
bundles from the input register block 320 (Ll(irO-irn)), the characterized by a sequential transmission of data values. It 
memory read port synchronized signal bundles(mr0-mr2) possesses significant advantages in the ability to perform 
and the synchronized results of the output register block 336 50 linear transformations (which includes Fast Fourier 
(L7(orO-orn)). The Adder's accumulators L7 now serve as Transforms(FFTs), Finite Impulse Response (FIR) filters, 
the output registers, with the block 334 * K5:OutRegMux(s)' Discrete Cosine Transf orms(DCTs)) , convolutions and poly- 
selecting between adder result signal bundles), input reg- nomial calculations upon data streams. Linear Transforma- 
ister signal bundles (irO-irn) and memory read port signal tions are characterized as a square M by M matrix a times 
bundles (mr0-mr2). The Adder 328 may also possess status 55 a vector v generating a resultant vector. In the general case, 
signals, such as equality, zero-detect, overflow, cany out, each result to be output requires M multiplications of apj] 
. etc. which may also be registered. They are left silent in this with v[j] for j-0, . . . , M. The result may then be sent to one 
diagram to simplify the discussion. or more output registers where it may be written into either 
The one-port memory block 330 contains a write data of me memories. If me matrix is symmetric about the center, 
multiplexor block 340, labeled 'K6:l-port Write Mux' 60 so that afij]»a[i,n-j] or a[ijj»- a[un-j]> then an optimal 
which selects between the input register signal bundles sequencing involves adding or subtracting v(j] and v[n-j], 
*ir0-irn' and the output register signal bundles 'orO-orn'. followed by multiplying the result by a[ij], which is accu- 
The selected signal bundle is sent to the write port of the mutated in the multiplier's accumulators). This dataflow 
memory. The read port sends its signal bundle to a read reduces the execution time by a factor of two. Note that 
register 342, labeled *L8:l-port Read Reg', which synchro- 65 assuming the matrix a can be stored in the one port memory 
nizes these signals for use elsewhere. This memory can only and the vector v can be stored in the three port memory, the 
perform one access in a clock cycle, either reading or multiplier is essentially always busy. This system data flow 
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does not stall the multiplier. In fact, when the matrix is 
symmetric around the center, the throughput is twice as fast. 

Convolutions are characterized by acting upon a stream of 
data. Let x[-o], . . . , x[0], . . . , x[n] denote a stream centered 
at x[0]. A convolution is the sum c[0]* xf-n]* x[0]+. . . 
+c[n]*x[0]*x[n]. After calculating each convolution result, 
the data x(-n] is removed, the remaining data is "moved 
down" one element and a new piece of data becomes xfn]. 
Assuming that the x vector can be stored in the three-port 
memory, the acquiring of a new data element does not slow 
down the multiplier. The multiplier is essentially busy all the 
time. Polynomial calculations arc optimized inside the 
multiplier-accumulator architecturally. Assuming sufficient 
memory to hold the coefficients, these multiplier- 
accumulator calculations can be performed on every clock 
cycle. Large-word integer multiplications are also efficiently 
implemented with these circuitry of FIGS. 7 and 8. Let A[0] 
to A[n] be one large integer and B[0] to B[m] be a second 
large integer. The product is a number C[0] to C[n+m] which 
can be represented as: 

qO>Leasl Significant Word of A[0]*B[0], 

C[l)-A[1]*B[0]+A[0]*B[1]+Second word of qo]. . . 

qn+m]=A[n]*B[m]+Most Significant Word of C[n+m- 

i] 

These calculations can also be performed with very few lost 
cycles for the multiplier. Circuitry built around FIG. 25 has 
the advantage in that bounds checking (which requires at 
least two adders) can be done in a single cycle, and sym- 
metric Matrix Linear Transformations can simultaneously 
be adding or subtracting vector elements while another 
adder is converting the multiplier's accumulators). 

Although the preferred embodiment has been described in 
detail, it should be understood that various changes, substi- 
tutions and alterations can be made therein without depart- 
ing from the spirit and scope of the invention as defined by 
the appended claims. 

What is claimed is: 

1. A synchronous multiplier-accumulator comprising: 
a first pipeline stage including: small bit multipliers to 
generate partial products from arithmetic data signals 
and an adder network coupled to the small bit multi- 
pliers to receive and sum said partial products and 
wherein said small bit multipliers support processing of 
p-adic arithmetic data signals, where p is a prime 
number; said adder network comprising local carry 
propagate adder cells configured as a multi-level adder 
tree to generate the product of said arithmetic data 
signals at an output level of said adder tree; said first 
pipeline stage also including a first accumulator having 
a plurality of registers to store results from one level of 
said adder tree for input to the next level of said adder 
tree; said first pipeline stage being operable to generate 
and sum said partial products and to store said results 
in said first accumulator during one clock cycle; 
a second pipeline stage comprising a second accumulator 
having a plurality of registers to store results from a 
further adder comprising a plurality of local carry 
propagate adder cells; and an interface circuit coupled 
to the second accumulator to selectively access one or 
more stored results stored by said second accumulator; 
said output level of said adder tree coupled to input said 
product to said further adder; said second pipeline stage 
being operable during a clock cycle subsequent to said 
one clock cycle to selectively output one or more stored 
results from said second accumulator for output from 
said multiplier accumulator and/or for feedback to said 
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further adder, and to operate said further adder and said 
output level of said adder tree. 
2. A multiplier-accumulator according to claim 1, wherein 
said multiple level adder tree has either 3 or 4 levels. 
5 3. A multiplier-accumulator according to claim 1, wherein 
said second pipeline stage includes alignment circuitry to 
align said product of the arithmetic data signals from the 
adder tree with precision components of a result stored by 
the second accumulator, and wherein said feedback input is 
10 coupled by said alignment circuitry to the further adder. 

4. A multiplier-accumulator according to claim 1, wherein 
said subsequent clock cycle is next to said one clock cycle. 

5. A multiplier-accumulator according to claim 1, said 
adder tree comprises a uniform adder tree or a k-ary adder 

15 tree * 

6. A multiplier-accumulator according to claim 1, wherein 
p<-31. 

7. A multiplier-accumulator according to claim 1, wherein 
p-7 or p-31. 

^ 8. A multiplier-accumulator according to claim 1, wherein 
said small bit multipliers include an input multiplexer oper- 
able to selectively couple to said small bit multipliers, 
arithmetic data signals or the contents of registers of said 
second accumulator selected by said interface circuit. 
^ 9. A multiplier-accumulator according to claim 1, wherein 
said second pipeline stage includes at least one further 
second accumulator to store results from said further adder, 
and wherein said interface circuit is also coupled to access 
one or more stored results stored by said at least one further 
second accumulator. 

10. A synchronous multiplier-accumulator comprising: 
a first pipeline stage including: small bit multipliers to 
generate partial products from arithmetic data signals 
an adder network coupled to the small bit multipliers to 
35 receive and sum said partial products; said adder net- 
work comprising local carry propagate adder cells 
configured as a multi-level adder tree to generate the 
product of said arithmetic data signals at an output level 
of said adder tree; said first pipeline stage also includ- 
4Q ing a first accumulator having a plurality of registers to 
store results from one level of said adder tree for input 
to the next level of said adder tree; said first pipeline 
stage being operable to generate and sum said partial 
products and to store said results in said first accumu- 
45 lator during one clock cycle; 

a second pipeline stage comprising a second accumulator 
having a plurality of registers to store results from a 
further adder comprising a plurality of local carry 
propagate adder cells; and an interface circuit coupled 
50 to the second accumulator to selectively access one or 
more stored results stored by said second accumulator; 
said output level of said adder tree coupled to input said 
product to said further adder: 
said second pipeline stage being operable during a clock 
55 cycle subsequent to said one clock cycle to selectively 
output one or more stored results from said second 
accumulator for output from said multiplier accumula- 
tor and/or for feedback to said further adder and to 
operate said further adder and said output level of said 
60 adder tree; 

wherein said first accumulator is located between levels of 
said adder tree to provide approximately equivalent 
signal propagation delays from the multiplier input to 
the first accumulator, and from the first accumulator to 
65 the second accumulator. 

***** 



